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DIFFERENCES WITHIN AND BETWEEN 
COMMUNITIES IN THE INTELLIGENCE 
OF THE CHILDREN! 


EDWARD L. THORNDIKE AND ELLA WOODYARD 


Teachers College, Columbia University 








We have records of the scores of sixth-grade pupils from thirty 
cities in the National Intelligence Test, parts A and B combined. All 
the testing was done by Dr. Woodyard at various times during the 
school year 1939-1940. Except for absences and for possible errors 
on the part of the school officers of the cities, the children in any city 
comprise the entire public-school sixth-grade population, and none 
outside it. We have also a record of the age of each child at the time 
of the test. 

It would have been better to have measured all the children of a 
given age, but this would have involved much labor of school officers, 
and a considerable disturbance of the school programs. Moreover, 
our reports back to the teachers would have been of much less interest 
to them. Consequently, our request for permission to test was 
restricted to one grade, and accompanied by our promise to report 
the facts for every pupil in that grade. 


DIFFERENCES WITHIN A COMMUNITY 


The differences between individuals within the same community 
are very great. To include fifty per cent of the sixth-graders in one 
of these cities requires a spread of from forty to sixty points on the 
test scale in a city with few Negro pupils, and from sixty to ninety 
points in a city with many Negroes. Sixty points equals approximately 
the difference between the average score in grade V and that in grade 
VII, between the average for age eleven and that for age thirteen, or 
between the average for age nine and that for age eleven. It is one- 





1 The investigation reported here was made possible by financial support from 


the Carnegie Corporation. 
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TABLE I.—-VARIATION WITHIN GRADE I IN ScoRE IN NATIONAL INTEL- 
LIGENCE Test, Form A Pius Form B, In Srx CITIES 
Cities A and B are the two with least variation among our thirty 
Cities C and D are the two with most variation among our thirty 
City £ is near the median for northern cities 
City F is near the median for southern cities 
The minimum and maximum attainable scores are 0 and 358 
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Score 


Frequency (in per cents) 





A 


B 


Cc 


D 





39 
49 
59 
69 
79 


30 to 
40 to 
50 to 
60 to 
70 to 


80 to 89 
90 to 99 
100 to 109 
110 to 119 
120 to 129 


130 to 139 
140 to 149 
150 to 159 
160 to 169 
170 to 179 


180 to 189 
190 to 199 
200 to 209 
210 to 219 
220 to 229 


230 to 239 
240 to 249 
250 to 259 
260 to 269 
270 to 279 


280 to 289 
290 to 299 
300 to 309 
310 to 319 
320 to 329 
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sixth of the difference between zero and the maximum score attainable 
(358; 152 for Form A and 206 for Form B). 

Table I shows the variation in six cities: The two in which it is least, 
the two in which it is greatest, the one nearest the median of northern 
cities in this respect, and the one nearest the median of southern cities 
with at least one-sixth as many Negro children as white in grade VI. 

If we had measurements of each pupil by a large number of tests 
given at various times the variability of the grade population in a city 
would be reduced, but even with perfect measures of intelligence, there 
would still be nearly a quarter of the population in grade VI who were 
above the median of grade VII and below the median of grade V. In 
spite of all that has been said concerning grading by ability, pro- 
motion as an annual or semiannual routine by the edict of the teacher 
is still potent. The edicts of teachers are still issued with little con- 
sideration of the intellectual abilities of the pupils. It is our impres- 
sion that the variability within grade VI in one of these cities twenty 
years ago was not very much greater than it is now. 

In all schools of two cities of the thirty, and in some schools in two 
other cities, there is a grouping into classes within grade VI according 
to intelligence, permitting some modification of the curriculum to fit 
ability. This could reduce greatly the worst evils of coarse grading. 
The work of the superior sixth-grade class could be really more 
advanced than that of the inferior or average seventh-grade class, and 
the work of the inferior sixth-grade class could be really less advanced 
than that of some of the fifth-grade classes. But such arrangements 
are very rare, as is shown in a later section of this report. 

There is no evidence that the cities that grade more closely than 
others are in general superior in their school work, or in the quality 
of their residents, or in their provisions for general welfare. The corre- 
lations with the Thorndike P score, and the Thorndike G score show in 
fact a slight tendency in the opposite direction. Whatever be the 
merits of reducing the variation in ability within a grade by rapid 
promotion of the bright and long retention of the dull, the cities that 
rate highest do not practice it more than the cities that rate lower. 

Judging a practice by its affiliations is in general inferior to judging 
it by its consequences. The use of quinine for malaria was meritorious 
at the time when the Paris Academy of Medicine declared against it. 
But arguments from affiliations deserve some weight; and the reasons 
why good cities give so little consideration to intellectual ability in 


promoting pupils should be studied. 
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THE CLASSIFICATION OF THE SIXTH-GRADE PUPILS WITHIN THE SAME 
SCHOOL BUILDING 


In the thirty cities there are seventy-one buildings having only one 
sixth-grade class, seventy buildings having two classes, twenty build- 
ings having three, and eight buildings having four or more.! There 
is only one city which has no building with more than one sixth-grade 
class, and which consequently could not classify by ability except by 
moving children to a remote building. Of the other twenty-nine cities 
two clearly classify by intellectual ability. The median scores in our 
test for three pairs of classes, in three buildings in one of these cities 
were 166 and 211, 152 and 221, and 177 and 223. The other of its 
buildings had four classes with medians of 135, 158, 197 and 224. The 
second city had classes with medians as follows: 

Building A, two classes, 16814, 21914 

Building B, two classes, 18014, 232 

Building C, three classes, 144, 164, and 200 

Building D, three classes, 150, 17314, and 217. 

A third city in the three of its buildings which had two sixth-grade 
classes showed medians of 178 and 240, 190% and 23314, 200 and 232, 
the ages of the class higher in intelligence score being a little younger 
than those in the class lower in intelligence. 

A fourth city showed classification, but of a rather imperfect sort, 
in five of the eight buildings which had two or more sixth-grade classes. 

A fifth city which has four buildings with three sixth-grade classes 
each, has one section of duller pupils in each of these buildings, but the 
selection is imperfect, omitting many of the lowest scoring third. This 
city has one building with four classes in which there was probably 
some consideration of intellect, but the classification is very imperfect 
from that point of view. The lowest class has one-sixth of its pupils 
scoring over 200, and the highest has one-sixth of its pupils scoring 
under 180. The medians of the highest and lowest are only thirty-two 
points apart, whereas they would have been sixty-nine points apart if 
selected by intelligence scores. 

A sixth city has, in a building with three classes, one class chosen 
for low intellect, but in four buildings containing two classes each pays 
little or no regard to intellect high or low in its classification. 

Twenty-three of the thirty cities ignore or reject classification by 
intellectual ability. The five cities which use it are a little, but not 





1 These figures do not include schools for Colored pupils only. 
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much, above the average of the thirty in the indices of education, the 
personal qualities of their residents, and the general goodness of life 
for good people. 


DIFFERENCES BETWEEN COMMUNITIES 


The median score for each city is shown in column 3 of Table II. 
These median scores must however be corrected to allow for differences 
in the time of year of the testing, and for differences in the age com- 
position of the groups apart from those caused by differences in the 
time of testing. 

The ideal way to do this is to correct each child’s score to what he 
would presumably have attained if tested at some one age (say, 12 
years, 0 months), but this is extremely laborious. Results probably 
adequate for our purpose can be obtained by correcting the median 
score of each of the thirty cities by the difference of the median age 
of the children tested at the time of the test from 12 years 0 months. 
This difference will measure approximately the combined influence of 
the difference in age distribution at the same point (e.g. the middle) 
of the school year, and of the difference in the time of taking the test. 

From age ten and one-half years to age thirteen and one-half years 
the average white child in the United States gains about thirty points 
per year, the facts reported in the National Intelligence Tests Manual 
of Directions being as follows: 


ScaLE A Scare B SumA+8 


Te ST re i 76 80 156 
ree ~ 93 96 189 
ee — 109 110 219 
CE ee rrr re : 124 123 247 
EE eee 33 
PO BUDD... 6k ae ecee aes 30 
Gee ae OO BOS... 50. cc cewks 28 


We, therefore, multiply by 30 the fraction of a year by which the 
median age of the sixth-grade pupils tested in a city at the time of the 
test differs from 12.0 years, and add the product to, or subtract it 
from, the median score of the city according as the median age is below 
or above 12.0. The corrected median score for each city appears in 
column 4a of Table II. The difference between any city and the 
median or average of all thirty cities in this corrected median score 
will be roughly the same as the corresponding difference if all twelve- 


4 
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TABLE II.—OsTAINED MEDIAN SCORE AND CORRECTED MEDIAN SCORE 
AS OF AGE 12 Years 0 Monrus ror Eacu City 
































(1) (2) (3) (4a) | (4b) (5) (6a) (6b) 
Median score cor- ; 
rected for differences Quotient reached 
Obtained | in pupils’ ages and Standard ved : xceeded by 
City Date of | median | the time of year | etror of per cent 
tests Pale aites ee the any 
ee | median 4 
All White | | an | White 
pupils pupils pupils pupils 
1 | 10- 9-39) 162.3 GD Sera 3.8 141 
2 | 10-11-39; 186.9 ee 8 yee 3.2 153 
3 | 10-17-39; 181.4 en 8. sceee 3.1 154 
4 | 10-20-39; 206.4 YY SS ore 2.6 164 
5 | 10-23-39; 210.0 Fe ae 3.1 166 
6 | 10-25-39; 210.4 Dee ff 6 ases 3.2 163 
7 | 10-30-39; 205.0 Be er 4.3 182 
8 | 12-13-39} 205.9 8 ee ere 3.7 162 
9 | 12-18-39| 205.7 ee § | ac6es 3.1 157.5 
10 | 12-22-39; 203.3 8 oe ae 2.3 152.5 
11 1- 9-40; 219.7 «i aaa 3.4 161.5 
12 1-15-40 | 200.5 + aaa 4.0 156 
13 1-18-40 | 216.4 EY ae 2.7 165 
14 1-22-40 | 222.4 TY Se 2.9 162.5 
15 1-26-40 | 210.1 « ia 2.7 156 
16 3- 1-40} 213.0 210.2 216.6 2.2 168 175 
17 3- 5-40} 185.6 170.6 196.8 2.1 148 163 
18 3- 8-40; 199.6 188.1 208.1 2.2 148 154 
19 3-12-40 | 193.2 177.7 192.7 2.1 142.5 147 
20 3-13-40 | 193.1 178.8 197.4 2.4 148 150 
21 3-21-40 | 204.9 185.8 201.4 2.2 148 153 
22 3-26-40 | 208.7 191.4 199.7 2.6 150 153.5 
23 3-29-40 188.1 173.1 187.0 2.4 149 152 
24 4- 1-40; 197.5 180.0 196.0 2.3 150.5 154 
25 4- 3-40} 192.3 169.3 189.2 2.3 139 148 
26 4- 5-40; 190.8 167.3 188.7 2.7 147 150 
27 4- 9-40; 201.8 188.7 221.7 2.6 149.5 153 
28 4-11-40 179.2 169.2 187.9 3.4 152 154 
29 4-22-40} 212.2 196.9 198.1 3.4 149.5 149.5 
30 4-25-40 | 197.6 197.4 196.7 2.5 153 153 
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year-olds had been used throughout instead of the sixth-grade children 
tested, provided that the sixth-grade children tested represent all 
sixth-grade children fairly or with identical errors as to the absence of 
feeble-minded in institutions, in grades below VI, or in private schools, 
the absence of intellectually gifted children in private schools, etc. 

A rough measure of the quantitative variation in the sampling is 
had by computing the fraction which the number of test records is of 
the number of children ten to fourteen years of age in 1940 for each 
city. The facts are as follows: 





Frequency 





Number of records 
population 10-14 





Southern cities using 


Northern cities whites only 





.09 ‘a 1 
.10 
.12 
13 
.14 
15 
.16 
Sj 
.18 
.19 
. 20 
21 1 
. 22 
27 


—e CWOwWHN WN 
nw 


— ee OO 











The variation is thus unexpectedly and regrettably large. There is, 
however, almost no relation between the percentage of the ten- to 
fourteen-year population found by us in grade VI and the intelligence 
rating of the city. For all thirty cities the correlation is —.07. But 
if the northern and southern cities are measured from central tend- 
encies for the North and South, respectively, the correlation is +.09. 
Consequently it seems unlikely that any of the conclusions drawn later 
in this article would be much altered if our sampling were improved 
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by the inclusion of pupils in parochial and other private schools. The 
final test of their validity will be the measurement of a perfectly 
random sampling of children from each city. 

Because of the great variability within each city, the unreliability 
of the average or median score for a city is appreciable even when there 
are several hundred sixth-grade children measured. The mean square 
error of each corrected median computed by dividing the semi-inter- 
quartile-range by +/n and multiplying by 1.8531, ranges from 2 to 4 
points, or from 1 to 2 per cent of the National Test score, as shown in 
column 5 of Table II. Though appreciable, these unreliabilities are 
inconsiderable in comparison with the difference (67) between the 
highest-scoring and lowest-scoring cities, and can account for only a 
small fraction of the general variation among the thirty cities. 

We have computed for many pupils in each city a quotient, 


100 X score obtained by the pupil 
score expected of the median person of his age 





These quotients are not strictly comparable with I1Q’s, but would 
correlate almost perfectly with them, and can be transformed into 
them by a suitable table. The quotient reached or exceeded by two 
per cent of the pupils in grade VI is recorded for each city in column 
6a of Table II. There is a fairly continuous range from 141 to 168, 
and one record of 182. This last is not a chance result, for the city in 
question has its 97 percentile at 173, and its 96 percentile above 167. 
It is the home of a large university. Apart from this one city the corre- 
lation between the 98 percentile score and the median score (corrected) 
is .86; including it, the correlation is .81. 


CORRELATIONS OF THE INTELLIGENCE SCORES WITH OTHER 
CHARACTERISTICS OF THE THIRTY CITIES 


The thirty cities were chosen from the one hundred fifty-nine 
having from twenty thousand to thirty thousand residents in 1930. 
For each city we have three indices: 

G, an index of the general goodness of life for good people in the 
city; P 144, an index of the personal qualities of the city’s residents; 
and J, an index of the per capita income of the city’s residents. 

For twenty-six of the thirty cities, G, P and J are computed from 
the items listed in the table on pages 649-650. For the other four, the 
G, P and I scores are estimated from records that lack some of these 


items. 
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CONSTITUENTS OF THE G Score or INDEX 


649 


APPROXIMATE 
WEIGHT 
Items of Health 
Infant death-rate reversed................0ccceceeeees 12 
General death-rate reversed............ 0.0.0.0. cc cee uee 9he 
Typhoid death-rate reversed..................00e0eeee 5 
Appendicitis death-rate reversed...................-0+5 4 
Puerperal diseases death-rate reversed.................. 4 


Items of Education 
Per capita public expenditures for teachers’ salaries..... . 6 
Per capita public expenditures for textbooks and supplies 7 
Percentage of persons sixteen to seventeen attending 


SS hil ong eee ees ee ee ee re 46 
Percentage of persons eighteen to twenty attending schools 7 
Average salary high-school teacher..................... 41¢ 
Average salary elementary-school teacher............... 3% 


Economic and ‘‘Social’”’ Items 


EIT Oem eee) ere APE eee 12 

Average wage of workers in factories........ 4 

Frequency of home ownership (per capita number of homes 
ES, ia oa anid eb hate ne ce Gk © gee ae a eee ee ere 6 


Creature Comforts 


Per capita domestic installations of electricity........... 5 
Per capita domestic installations of gas......,.......... 7 
Per capita number of automobiles..................... 4 
Per capita domestic installations of telephones.......... 11 
Per capita domestic installations of radios.............. 6146 
Other Items 
Per cent of literacy in the total population, aged 10 or older 314 
Per capita circulation of certain magazines.............. 6 
Death-rate from syphilis (reversed).................... 4 
Death-rate from homicide (reversed)................... 316 
Death-rate from automobile accidents (reversed)........ 416 


CoNnsTITUENTS OF P 144 


Percentage of illiteracy (reversed)..................24.. 4g 
Per capita number of homes owned..................... 1146 
Per capita number of telephones. . ae EE eg 
Per capita number of deaths from syphilis (reversed) ces ¢ 1 


Per capita number of deaths from homicide (reversed).... 1 
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CONSTITUENTS OF J 


APPROXIMATE 
WEIGHT 
Per capita number of income-tax returns of $2,500 or more 
(average of 1930 and 1931)...................eceee. 15 


Per capita number of income-tax returns of incomes ex- 
ceeding $5,000 (estimated from the data for counties) 7 


The average salary of high-school teachers.............. 146 
The average salary of elementary-school teachers........ 114 
Average wage in manufacturing plants................. 6 
Median rental (or equivalent in case of homes owned)... 3 


The thirty cities are not a random selection from the one hundred 
fifty-nine, but were chosen to include some specially high in G and P, 
some specially low in G and P, and a few that were mediocre, all to be 
east of the Mississippi. The ordinary correlational techniques are 
not strictly applicable to them, and will be supplemented by others. 
The composite scores—@ for the general goodness of life for good 
people, P for personal qualities of the population, and J for an index 
of per capita income—are given for each city in Table III in the form 
of deviations from the medians of the thirty cities. Table III also 
includes the median intelligence score for each city corrected for age 
differences, expressed as a deviation from the median of the thirty 
cities (Int.). 

The correlations (Pearson coefficients) of Int. with G are as follows: 


Fa ee 8646 
Int. with G in the seventeen cities in which the Negro pupils numbered 

less than six per cent of the white pupils........................ 84 
Int. with G in the thirty cities, using all pupils in the northern cities but 

only white pupils in the southern cities.......................... 73 
The probable errors of these coefficients are respectively, .03, .05 and 


.06. 


‘ Since the thirty cities are not a random sample of the one hundred 
fifty-nine cities of twenty to thirty thousand population (in 1930), we 
may estimate what the correlations for the one hundred fifty-nine 
would be by other methods than the use of product moments. Let 
us observe the regression lines of the intelligence score (Int.) on G. 
The distribution of G scores for the one hundred fifty-nine cities was 
as shown in column 2 of Table IV, having a standard deviation of 4.85. 

Our thirty cities have scores in G as shown in column 3 of Table IV. 
Consider them as forming a group of twelve clustered around —9, a 
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TaBLE III.—TuHe Scores or Eacu Crry 1n Int. G, P, anv 7, 
EXPRESSED AS DEVIATIONS FROM THE MEDIANS OF THE THIRTY 























CITIES 
Scores 
City Int. G P I 
1 —6 0 — 6 — 9 
2 l 1 4 — 1 
3 —] — | l 3 
4 6 5 2 5 
5 6 7 6 8 
6 6 7 6 12 
7 6 10 8) 22 
8 3 s 7 8 
iS) 3 7 g 7 
10 l — 1 — 7 — 4 
11 8 15 13 23 
12 l 0 — | — 3 
13 8 12 7 23 
14 ) 13 13 13 
15 5 10 10 8 
16 3 0 0 0 
17 —7 — § —10 — 4 
18 —2 — 5 — 9 — 6 
19 —5 — 3 — 4 0 
20 —5 — 4 —l1l1 — 2 
21 —3 — 5 —17 6 
22 —2 — 6 —12 —14 
23 —6 — 7 — 8 —10 
| 
24 —4 — 5 —l1 —10 
25 —7 -6 | -9 —7 
26 —8 — 5 —15 — 4 
27 2 | — 8 —15 0 
28 —7 | — 2 — 5 — 4 
29 0 | 5 4 4 
30 0 | 0 | l 1 











ji 
i 

‘4 
’ 
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Bi group of seven clustered around —4, small groups (of two and four) 
ae at +1 and +3, and a scattered group of five from +6 to +11 (averag- 
i ing +8.2). The average intelligence scores for these several groups are, 
A 
) TABLE IV 
: 
: Frequencies in G 
Scale in G waar. : — 
159 cities 30 cities ; 
i (1) (2) (3) (4) 
fet —12 1 1 
$ —l1 2 1 
if —10 2 2 
—- 9 6 5 178.2 
— 8 3 l 
— 7 3 l 
— 6 5 1 
— § a 2 
— 4 9 4 197.6 
— 3 5 1 
— 2 y 
n — 1 18 
0 20 
+ 1 16 2 209.2 
+ 2 13 
+ 3 15 3 
pd . ~ 216.1 
+ 5 8 
+ 6 2 l 
+ 7 3 1 
+ 8 3 1 
; 49 4 1 227.5 
mr +10 4 
f, +11 2 1 
ae in order: 178, 197144, 209, 216 and 22714, as shown in column 4 of 


Table IV. 


og he 
cat i 3a bea 
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We can compute what the average scores in Int. would be in cities 
scoring —9, —4, +1, +3, and 8.2 in G if correlation were perfect, if 
we can learn what the variation of the one hundred fifty-nine cities 
in Int. is. To this we have one clue—its range in our thirty cities. 
This is from 167.3 to 234.1, or 66.8. 

The range for the one hundred fifty-nine cities must be at least as 
great, and is probably somewhat greater. In an infinitely large 
sample, the standard deviation may be taken as one-sixth of the range, 
but in a series of one hundred fifty-nine the standard deviation will 
probably be more than one-sixth of the difference between the lowest 
and the highest. It may be expected to be approximately a fifth of it. 
In the G score of the one hundred fifty-nine cities, for example, it is a 
trifle over a fifth (SD = 4.85, and the difference between highest and 
lowest G scores is 23.0). If our selection of thirty caught the lowest 
and the highest cities of the one hundred fifty-nine in intelligence, the 
standard deviation of the one hundred fifty-nine cities may be set at 
one-fifth of 66.8, or 13.36. 


TABLE V.—ScorEs IN INT. CORRESPONDING TO VALUES OF G of —9, 
—4,0, +1, +3 AND +8 




















With perfect correlation 
By the actual 
| . 
IfSD = 13) IfSD = 14|IfSD = 15| relation 
(1) (2) (3) | (4) 
For G = —9 180.9 179.0 177.2 178.2 
G= -4 194.3 193.5 192.6 197.6 
G= 0 205.0 205.0 205.0 
G = +1 207 .7 207 .9 208 . 1 209.2 
G = +3 213.0 213.7 214.3 216.1 
G= +8 226.2 228 . 1 229.7 227.5 

















In Table V, columns 1, 2 and 3, we show the regression values of 
Int. if correlation is perfect and if the standard deviation is 13, 14, or 
15, respectively, assuming that the intelligence score for the median 
city is 205.1 If the range of our thirty cities in Int. is less than the 


1 It is near that, since the average sixth-grade score for white pupils in cities 
reported in the National Intelligence Test Manual is 215 (207 plus an allowance of 
8 for the fact that our tests were given later in the school year). A moderate shift 





up or down in this assumption for the median will not weaken our argument. 
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range of the one hundred fifty-nine, the standard deviation of the 
one hundred fifty-nine cities will, of course, be larger; perhaps as large 
as 15. Column 4 of Table V repeats the actual average scores in Int. 
of the groups of cities clustered around —9, —4, +1, +3 and +8. 

By any of these values for the SD of the one hundred fifty-nine 
cities in Int., it is clear that our correlation of .86)% is little, if any, too 
high. On the contrary, from —9 to +8 the change by a correlation 
of .865 and standard deviations of 13, 14, and 15, respectively, is 
3914, 42, and 4514 points. By the actual facts it is 4914. 

As a second check, the correlation may be computed from the 
reduction of the variability in an array, using our groups of twelve 
cities around —9 in G score, seven cities around —4, four cities around 
+3, and five cities around +8 as four arrays. These are of course 
very broad arrays, and their variabilities will, therefore, tend to be 
greater a Se of the very thin slices assumed by the formula 

variability of array ; 
variability of total distribution V1—"- Using 13, 14, and 15 
as the standard deviation of the totai distributions, the values of r 


obtained are as follows: 





Ir SD = 13 Ir SD = 14 IF SD = 15 


Array around —9 (n = 12). .78 .814 84 
Array around —4 (n = 7).. .63 .696 .74 
Array around +3 (n = 4).. .89 . 906 .92 
Array around +8 (n = 5).. .88 .895 91 
Median of the four 7’s...... .83 .854 .875 


Our correlation of .865 is very close to the general drift of the values 
by SD = 14 or SD = 15 for the one hundred fifty-nine cities. 
The measurements by the regression and by the reduction of varia- 
bility together suggest that .8614 is not too high. 

We may check the Pearson coefficient of .73 (obtained when only 
white people are used in the southern cities) by similar treatment of 
the regression and the reduction of the variability. We use 213 as the 
median score for the one hundred fifty-nine cities, and 12, 13 and 14 
as possible values of the standard deviation of the one hundred fifty- 
nine cities in the median Int. of white children (in the northern cities, 
all children) of age 12.0. The standard deviation of cities in the 
median intelligence score of white pupils will be less than that for white 


and colored. 
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The expected values of the means of arrays of Int. for G = —9, 
0 and +8 for a correlation of .73 and standard deviation of 12, 13, 
and 14 in Int., using 213 as the median score for the 159 cities, are as 
follows: 


IfSD=12 IfSD=13 IfSD=14 


Tk ae 196.8 195.4 194.0 
| err 213.0 213.0 213.0 
Cities +8 inG............. 227.3 228 .6 229.9 


The actual values were 197.1 for G = —9 and 227.5 for G = +58.2, 
very close to the expected values forSD = 12. If SD is 13, a correla- 
tion of .66 gives 197.1 and 227.1, very close to the actual values. 

Using the reduction of the variability in our groups of twelve, 
seven, four, and five cities, the median r of the four determinations, 
if the SD of the one hundred fifty-nine cities is taken as 12, is .73. If 
the SD is taken as 13, the median r is .78. If the SD is taken as 14, 
the median r is .82. 

Thus the regression determinations give r = approximately .73 
for SD = 12, and give lower correlations for higher values of SD. 
The determinations by the reduction in the variability of the arrays 
give r = approximately .73 for SD = 12, and give higher correlations 
for higher values of SD. The correlation of intelligence of the young 
residents of a city with the general goodness of life in the city is 
thus still high, after the Colored children are excluded in southern 
cities. There is reason to believe that if the Colored pupils were 
excluded from the computations in northern cities as well as southern 
the correlation of Int. with G would have been higher than .73. 

On the whole we may set the correlation between Int. and G as 
near .86 when the colored are included, and near .75 when the colored 
are excluded. We have then the striking fact of a very intimate 
association between the index (@) of the general goodness of life in 
a@ community and an index (Int.) of the intelligence of its children of 
age twelve. About three-quarters of the variation (.86?) in G among 
these cities of twenty to thirty thousand inhabitants seems to be 
accounted for by whatever is measured by the Int. of the children; 
and over half of it (.75?) by whatever is measured by the Int. of the 
white children alone. 

We will not try at this time to measure how far this association is 
due to the power of intelligence in a population to make a city have the 
sort of life measured by a high G score, and how far it is due to the 
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power of that sort of life to make children have high Int. scores other 
than via the genes of their parents. The evidence presented in Edu- 
cation as Cause and as Symptom (Thorndike, 1939) is relevant. 

We turn now to the correlation of Int. with P 144 (so called to 
distinguish it from the P score used in Thorndike’s Your City for cities 
of thirty thousand to five hundred thousand population in 1930). As 
stated earlier, P 144 is a weighted composite of a city’s scores in 
literacy, infrequency of deaths from syphilis, infrequency of deaths 
from homicide, frequency of home ownership, and frequency of tele- 
phone subscribers. 

The correlations of Int. with this rough measure of desirable per- 
sonal qualities in a city’s residents are as follows: 


Tot. watts P 1446 im the Ghirhy GHtOs. .. 2... nc cc ccc ccc cc ccc cee .82 
Int. with P 144 in the seventeen cities in which the Negro pupils numbered 

less than six per cent of the white pupils........................... 77 
Int. with P 144 in the thirty cities, using all pupils in the northern cities 

but only white pupils in the southern cities........................ 63 


We have checked the .82 by the facts for regression and for the 
reduction of the variability of the arrays of Int. under P 144. It 
represents the facts fairly. The correlation would presumably be 
higher if P 144 were a perfect measure of desirable personal qualities 
in a city’s residents, or even as adequate a measure as the P score 
used in Your City. The correlation of Int. with an adequate P will 
probably be as high as its correlation with G. 

We have for each city an index (J) of per-capita income computed 
as stated under the caption, “Constituents of J,’”’ and giving the 
measures shown in the column of Table III headed J. The correlation 
of Int. with J in the thirty cities is .78 with a probable error of .05. 

Subject to amendment by further investigations the intelligence- 
test score of a community must be considered a very important symp- 
tom of its G (the general goodness of its life for good people), its P (the 
desirable personal qualities of a population), and its (per-capita income). 
If correlations approaching the .8614, .82 and .78 that we have found 
are typical, we may match Terman’s claim that IQ is the greatest 
single factor in a person’s success by the claim that the average IQ 
of a community is the greatest single factor in its welfare. 





SEX DIFFERENCES IN ACHIEVEMENT 
IN THE ELEMENTARY AND SECONDARY SCHOOLS 


J. B. STROUD AND E. F. LINDQUIST 


State University of Iowa 


This article reports the data on sex differences in school achieve- 
ment yielded by the Iowa Every-Pupil Testing Program, high school, 
for the years 1932 to 1939, and the Iowa Every-Pupil Basic Skills 
Testing Program (Grades III-VIII) for the year 1940. A brief 
review of representative articles dealing with previously published 
investigations of a similar character is also given. 


A REVIEW OF THE LITERATURE 


This brief review chiefly treats sex differences in achievement test 
scores, although attention is called at the outset to sex differences in 
school marks, promotion, acceleration, retardation, and similar evi- 
dences of school progress. In his Laggards in Our Schools, 1909, 
Ayres? concluded that ‘‘our schools as they now exist are better fitted 
to the needs and natures of the girl than of the boy pupils.”” He 
based this conclusion upon an analysis of the records of several hundred 
thousand pupils in various cities of the nation. In 7624 high schools 
in 1906-1907 there were 314,084 boys enrolled in comparison with 
419,570 girls. In the elementary schools in fifteen cities, having an 
enrollment of 282,179 pupils, he found retardation among 37.1 per 
cent of the boys and 32.8 per cent of the girls. Approximately 23 
per cent of the boys were repeating grades in comparison with 20.2 per 
cent of the girls. It is known that in recent years the number of boys 
in high school more nearly equals the number of girls. 

St. John’s'® data on retardation and acceleration make possible 
sex comparisons at comparable IQ levels. His investigation deals 
with the progress, over a four-year period, of about five hundred boys 
and four hundred fifty girls, Grades I to VI, chiefly I to IV, enrolled 
in the schools in a residential suburb of Boston. Table I shows the 
sex comparisons. St. John’s data also show that correlations between 
IQ and achievement data were higher for girls than for boys. In 
marks of conduct and effort girls achieved a greater degree of superior- 
ity than in any of the other measures used. 

Johnson’s® analysis of the records of the high-school pupils in St. 
Louis is to the same purpose. His data are shown, in part, in Table II. 
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Ayres suggested that the inferior showing of boys may result from 
‘“‘over-feminization”’ of our schools. It is not entirely clear just 
what factors he had in mind. St. John leaves no doubt as to his 


TaBLE I.—Pupits REPEATING AND SKIPPING GRADES, IN TERMS 
OF PERCENTAGES OF THEIR IQ Grovups* 











Special class and repeaters Gaining a year 
IQ 
Boys Girls Boys Girls 

RE So a Snr 66.7 100.0 
EE eae Tero 18.2 33.3 
Se eR: Pek as oe 23.3 34.5 
110 3.6 2.6 11.9 9.0 
100 \ 24.4 18.8 3.0 1.8 
90 ‘59.2 48.8 1.5 0.8 
80 88.9 86.4 

70 100.0 100.0 

60 100.0 

















*St. John, C. W.: Op. cit. 


TaBLE II.—Some Sex COMPARISONS IN SEVEN HIGH SCHOOLS IN 


Sr. Lovis* 
(For One Semester) 

DaTA Boys 
Number courses taken....................0..008. 28,850 
Number courses repeated first time............... 2,110 
Number courses repeated second time............ 337 
Number courses repeated third time.............. 71 
Per cent failure at end of term................... 8.5 
Average age of graduating class.................. 17.7 

Graduating class, average number years required 
EE Ee en See wie rae 4.2 
a Os. 5 oa ae bin celebs bas wabeen 105.3 


* Johnson, G. R.: Op. cit. 


GIRLS 
29,238 
1,517 
166 

19 

5.0 
17.1 


4.0 
104.8 


position, as is seen in the following: ‘‘ The consistent inferiority of the 
boys in school progress and achievement is due chiefly to a maladjust- 
ment between the boys and their teachers which is the result of 








Sex Differences in Achievement 659 


interests, attitudes, habits and general behavior tendencies of boys to 
which the teachers [in his study, all women] fail to adjust themselves 
and their school procedures as well as they do to the personality traits 
of girls.” Johnson concurs in this opinion. 

It may be pointed out that there are other alternatives. There 
is reason to suspect that girls experience something of a generalized 
feeling of inferiority with respect to their sex. Their superior attain- 
ment in school may be the result of compensatory adjustment. More- 
over, as we shall see presently, this feminine superiority is not a general 
one, but exists only in certain subjects. In the second place, there is 
reason to question the practice of regarding all behavior problems in 
school as instances of maladjustment. Group mores may dictate 
some of them. Problem behavior may be little more than a method 
of having some fun. Feminine mores do not permit of as much lati- 
tude in this regard as do those of boys. Uncodédperativeness with the 
teacher and a certain amount of nonchalance in doing the assignments 
may, like tripping a fellow-student as he perambulates down the aisle, 
be symptomatic of maladjustment or may be little more than instances 
of accepted masculine behavior. 

In the subsequently cited investigations sex comparisons are 
based on achievement test scores. If the marks earned by boys and 
girls are without sex bias, these investigations are continuous with 
those described in the foregoing paragraphs. However, as some bias 
may be present, it is safer to regard the foregoing and subsequent 
investigations as pertaining to two separate, but related, problems. 

In 1927 Lincoln'! summarized the then existing literature relevant 
to ‘‘sex differences in school accomplishments.”’ For present purposes 
a brief statement of the principal findings will suffice. In reading, 
at the elementary-school level, girls tended to excel by small margins. 
Statistical treatment of the differences is usually not supplied, but from 
their magnitude it seems unlikely that the majority of them would 
have been found to be significant by the conventional standards. In 
speed and quality of handwriting girls had the advantage, both in the 
elementary school and in the high school. The investigations of 
handwriting are the only ones of this period pertaining to sex differ- 
ences in which the data permit of tests of statistical significance. The 
differences are significant. Girls showed a slight superiority in spelling 
in the lower grades, the advantage increasing somewhat at the higher- 
grade levels. On language usage tests and on composition scales, 
elementary and high school, girls excelled by an ostensibly wide 
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margin. The differences were not treated statistically. Boys and 
girls were approximately equal in achievement in arithmetic and 
in algebra; the former were superior in geometry. In history the 
advantage went to the boys. 

In the following paragraphs some of the investigations of sex dif- 
ferences published since 1927 are mentioned. 

Reading.—Various reports on sex comparisons in reading subse- 
quent to the publication of Lincoln’s summary have shown a con- 
spicuous absence of significant differences. Commins® obtained a 
nonsignificant difference of 5.2 raw score points in favor of girls on the 
reading tests of the Stanford Achievement battery, in the fifth grade. 
Traxler,!® Moore,!* and Jordan’ have failed to obtain significant sex 
differences in reading at the high-school level. 

Language and Literature —Cummins? obtained a significant differ- 
ence in favor of girls for a small group of fifth-grade pupils on the 
language test of the Stanford Achievement battery. Lund’? found a 
marked superiority in favor of college freshman girls on the English 
Placement and the Carnegie English tests. In his analysis of the 
results of the High School Senior Examination administered to more 
than nineteen thousand seniors in the North Carolina high schools, 
Jordan’® found that on the English Usage test only twenty-nine per 
cent of the boys equalled or exceeded the mean score for girls. On 
the test of knowledge of literature the two sexes scored equally. 
Carroll‘ reports a difference in favor of girls on his Prose Appreciation 
Test, about a third of the boys equalling or exceeding the median of 
the girls. 

High School Mathematics.—Various investigations have yielded 
evidence of a masculine superiority in geometry. Lund,'? dealing with 
entering college freshmen, composed of groups of boys and girls 
equated in general scholarship, found that boys excelled on the 
mathematics section of the Carnegie Foundation tests. A similar 
result is reported by Eells and Fox® for the mathematics section of the 
Iowa High School Content Examination, the boys and girls being 
equated on the basis of units of credit previously earned in high-school 
mathematics. On the mathematics subtests (consisting of algebra 
and geometry) of the North Carolina High School Senior Examination, 
Jordan!® obtained small differences in favor of boys. Thirty-five per 
cent of the girls reached or exceeded the mean for boys. Thirteen and 
fifty-seven hundredths per cent of the boys earned letter grades of 
A and B; 9.97 per cent of the girls earned like grades. The reader is 
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referred to additional articles by Foran and O’Hara,’ Perry,'* and 
Webb.” 

History.—In the investigations of sex differences in history boys 
have shown a consistent superiority. The reader is referred to articles 
by Commins,® Caldwell and Mowry,’ and Jordan.'® In Jordan’s 
investigation, of the nineteen thousand high-school seniors represented, 
15.02 per cent of the boys earned grades of A and B, as compared to 
9.93 per cent of the girls. On the other hand, 25.85 per cent of the 
girls earned grades of D and E, as compared to 18.53 per cent of the 
boys. 

Science.—In science the advantage likewise goes to boys, as is 
attested by investigations by Commins,’ Atkinson,' Hurd,* and 
Jordan.'® Comparisons were made in nature study by Commins, in 
physical science by Hurd, and in general science by Atkinson and 
Jordan. The latter found that thirty-two per cent of the girls equalled 
or exceeded the mean of the boys. Boys earned twice as many A’s 
and B’s and only half as many D’s and E’s as girls. 


SEX DIFFERENCES ON THE IOWA EVERY-PUPIL HIGH-SCHOOL TESTS 


For several years the College of Education and the Extension 
Division of the State University of Iowa have sponsored a state-wide 
achievement testing program at the high-school level. Comprehensive 
objective tests have been prepared yearly in each formally organized 
academic subject in the high-school curriculum, and administered to 
the pupils of the participating schools. Each of the examinations 
has been one hour in length. The average number of high schools 
participating in this program has exceeded three hundred; and the 
average number of pupils has exceeded fifty thousand. The tests 
have been administered in all schools participating in the program in 
the third week in May each year since the program was inaugurated. 

From time to time samples have been drawn from the files of test 
results and the scores analyzed for various purposes. The present 
article reports the results, with respect to sex differences, of twenty-six 
such projects in twelve different subjects, for data gathered for the 
years 1932-1939. The number of cases included in each sampling is 
shown in Table III. In selecting each sample, the test papers of all 
pupils tested were arranged alphabetically by pupils’ names within 
each school, and every tenth paper was drawn in each school. For 
each test Table III gives the number of boys and girls constituting the 
samples used, the mean scores for the two sexes, the differences between 
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TaBLE III.—Srex DIFFERENCES IN PERFORMANCE ON IOWA 
EveERY-PUPIL TrEsts oF HIGH-SCHOOL ACHIEVEMENT 























Number Mean scores Differ- Signif- 
Test icance 
Boys | Girls | Boys | Girls CACES | ratios 
pO SPER ger. se wes 231 | 269 | 16.08) 16.72)/—0.64 . 86 
NG ov dia earn as bes 530 | 468 | 14.94) 16.03)—1.09) 1.93 
PE hia Sane re 263 | 279 | 14.46) 10.10) 4.36) 1.63 
Plane geometry.......... 498 | 502 13.42) 12.20) 1.22) 6.32 
Plane geometry......... .| 470} 530 | 17.99) 17.26) .73) 1.58 
Plane geometry......... .| 4381 | 569 | 16.54) 14.96) 1.58) 3.10 
General science....... 260 | 240 | 49.54) 39.37) 10.17) 11.96 
General science....... 229 | 271 | 47.45) 38.06) 9.39) 8.24 
General science...... 548 | 452 | 47.6 | 40.7 | 6.9 | 8.73 
General science....... 525 | 475 | 46.12) 41.1 | 5.02) 7.28 
Biology......... 348 | 375 | 61.28) 57.74; 3.54) 3.05 
Biology........ 546 | 454 | 63.32) 62.1 1.22) 1.37 
Physics... .. 479 | 521 | 37.94) 31.78) 6.16) 8.32 
REE ARES SRNR A 98 | 39.76) 31.00) 8.76) 7.49 
World history...........| 487 | 513 | 46.74) 44.55) 2.19) 2.15 
American government....| 272 | 328 | 48.48) 43.63) 4.85) 4.58 
American government....| 449 | 551 | 61.03) 58.37) 2.68) 2.61 
American history........| 424 | 576 | 60.40) 55.34) 5.06) 5.11 
Contemporary affairs. 
Ninth grade........... 236 | 264 | 13.79) 11.28) 2.51) 2.65 
Tenth grade.......... 197 | 235 | 20.08) 12.84) 7.24) 6.35 
Eleventh grade........ 201 | 275 | 27.31) 19.61; 7.70) 6.21 
Twelfth grade.........| 232 | 268 | 33.50) 24.52) 8.98) 6.70 
Economics.............. 1122 | 1286 | 57.80) 54.92) 2.88) 4.97 
Reading comprehension. 
Part I................| 489] 551 | 78.87) 79.71/—0.84| .74 
Part II...............| 489] 68551 | 22.96) 23.27/i—0.31 57 
Latin. 
ESS eee 236 | 353 | 75.70) 68.50) 7.2| 3.36 
OU ices xa das vd 170 | 292 | 98.90) 97.80) 1.1 36 
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the means and the significance ratios (the ratio of the difference to its 
standard error). The minus sign is arbitrarily used to signify a 
difference in favor of girls. 

The differences in scores on the various tests are not, of course, 
directly comparable, since the tests are scored in different units. 
Accordingly, to make feasible a ranking of the differences in order of 
size, each difference in mean score was expressed as a per cent of the 
average standard deviation (for the two sexes) of the obtained scores. 
For example, on the third Test in Plane Geometry, the difference (D) 
in the means of the boys and the girls is 1.58. The standard devi- 
ation of, the obtained scores is 7.98 for the boys and 7.84 for the 
girls; thus the average “within-sex standard deviation” (cws) is 
(7.84 + 7.98)/2 = 7.91. Accordingly, the difference in mean scores 
is (100 X 1.58/7.91) = 19.97 per cent of the average standard devi- 
ation within sexes; that is, 100D = ows = 19.97. 

This percentage for any given test is inversely proportional, 
roughly, to the amount of overlapping between the distributions of 
scores for the two sexes, and is fairly comparable from test to test, on 
the assumption that the variation in achievement is essentially the 
same for all subjects. 


TABLE IV.—RANK ORDER OF SuBJECTS ACCORDING TO MAGNITUDE 
oF Sex DIFFERENCES 


RANK SUBJECT 100D/ows 
l General science 71.25 
2 Physics 70.70 
3 Contemporary affairs 51.17 
4 American history 32.48 
5 American government 27 . 25 
6 Economics 20.31 
7 Plane geometry 23.29 
8 Latin 15.78 
9 Biology 15.74 

10 World history 13.52 
11 Algebra 11.38 
12 Reading comprehension 4.14 


Table IV shows the rank order of the twelve high-school subjects 
according to the magnitude of the sex differences, as determined by 
the foregoing method. 
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TABLE V.—SEex DIFFERENCES IN PERFORMANCE ON THE 1940 Iowa 
Every-pupit Tests oF Basic SKILLS* 


























Number Mean scores 
Differ- 
Test and grade sais 
Boys | Girls | Boys | Girls 
Test A. Silent reading comprehen- 
sion. 
Part I—Reading comprehension. 

SNE... wateWn th duas ae ae 351 | 362 | 22.85) 24.97| —2.12 
SERS ee eee 370 | 371 | 13.57| 34.32) —2.75 
ete Ls adeies wa keiee Abs 346 | 374 | 38.30) 39.59) —1.29 
EPPS Serr ee 432 | 455 | 29.60) 29.90) — .30 

RE ae eer ee 533 | 541 | 34.64) 34.74; — .10 
ee aan os Ba a onl 446 424 | 41.64) 42.11) — .47 
Part [I—Vocabulary. 

Cn ias soap keen 351 | 362 | 18.62} 20.22); —1.60 
PS rer: 370 | 371 | 25.99) 28.20) —2.21 
RE eS 346 | 374 | 31.71) 33.41; —1.70 
Aa te amd oleate eee 432 455 | 19.92) 20.17; — .25 

Re eee eee 533 | 541 | 24.21) 24.67) — .46 
i a kong hina ont 446 424 | 29.81) 31.05) —1.24 
Test B. Work study skills. 

gS Ra a 334 | 343 | 28.28) 29.09) — .81 
EE ds Cle oie oe 6s Sine oh 371 | 366 | 40.29) 42.79) —2.50 
Ne Oe ee a 350 | 366 | 50.60) 43.49) —2.89 
ies ad «6k 6 aie 432 | 455 | 50.27| 53.27) —3.00 

SSE are 525 | 542 | 58.59) 60.92) —2.33 
Ee ae 449 | 422 68.33) 70.71) —2.38 
Test C. Basic language skills. 

ee a er 334 | 338 | 99.16)109.45|—10.29 
SS er eee 358 | 368 |/123.54)134.94);—11.40 
Tks Lees hiegutbod 335 | 363 |141.67|152.70)/—11.03 
EE ee 429 | 451 |183.93)196.91);—12.98 

dled Sin aig Eis pha 531 | 541 |198.31/215.57|—17.26 
SE alin ts Shik Giving 694 ta 440 | 429 |216.39|232.44;—16.05 
Test D. Basic arithmetic. skills. 

Cc cccccwsrocioscess 334 | 348 | 25.87) 25.21 . 66 
ES A eee 370 | 374 | 42.94 43.78) — .84 
Ns ad Spd ween 345 | 378 | 58.14) 57.92 .22 
Se a ee 425 | 457 | 38.29) 37.65 . 64 

Di tivrsestseles ewes 535 | 546 | 50.41) 50.18 . 23 
EET ee 460 | 433 | 63.73 61.62 2.11 


Signifi- 
cance 
ratios 


. 57 
. 38 
17 
. 39 
.14 


— GW bb 


. 30 
46 
37 
44 


Ww bv 


i) 


. 38 


.07 
15 
17 
. 98 


.14 


allt ih al ke 
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* Calculations made by R. K. Woods. 
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SEX DIFFERENCES IN SCORES ON THE IOWA EVERY-PUPIL TESTS OF 
BASIC SKILLS 


In 1935 the Iowa Every-Pupil Testing Program was extended 
downward to include Grades VI, VII and VIII, and in 1940 to include 
Grades III, IV, and V. The tests used in this program are known as 
the Iowa ‘Every-Pupil Tests of Basic Skills. There are two batteries, 
Elementary and Advanced with four tests in each battery. The 
average administration time of the Advanced Battery is sixty-seven 
minutes per test; of the Elementary Battery, forty-eight minutes per 
test. Test B, Work Study Skills, tests skill in reading maps, charts, 
and graphs, and in using the dictionary, indexes, and general refer- 
ences. Test A, Silent Reading Comprehension: Part I, Reading 
Comprehension and Part II, Vocabulary; Test C, Basic Language 
Skills; and Test D, Basic Arithmetic Skills, require no description. 

The samples used in the results here reported were drawn by select- 
ing the papers of every tenth pupil from alphabetical lists, separately 
prepared for each grade within each school. Thus by multiplying the 
N’s in Table V by 10 the approximate number of pupils tested per 
grade may be ascertained. The Elementary Battery was administered 
to Grades III-V; the Advanced Battery, to Grades VI-VIII. The 
sex comparisons for the 1940 testing program are shown in Table V. 
Again the minus sign is arbitrarily used to signify a difference in favor 
of girls. 

Table VI gives the rank order of differences in mean scores, each 
expressed as a percentage of the corresponding within-sex standard 
deviation. These values are averages for all grades in each subject. 


TABLE VI.—RANK ORDER OF TESTS OF Basic SKILLS ACCORDING TO 
MAGNITUDE OF SEx DIFFERENCES* 


RANK TEST 100D /cws 
1 Language 51.58 
2 Work study 18.18 
3 Reading-vocabulary 15.52 
4 Reading-comprehension 10.67 
5 Arithmetic 5.85 


* Calculations made by R. K. Woods. 


SUMMARY 


In the Iowa Every-Pupil Basic Skills Testing Program (for Grades 
III-VIII) girls have maintained a consistent and, on the whole, signifi- 
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cant superiority over boys in the subjects tested, save in arithmetic, 
where small, insignificant differences favor boys. These findings 
corroborate previous investigations in both these respects. On the 
other hand, in the Iowa Every-Pupil High School Testing Program the 
advantages just as definitely have gone to the boys, two exceptions 
being in algebra and reading comprehension, where small and on the 
whole not significant differences favor the girls. 

One explanation of this change in sex superiority from the ele- 
mentary school to the high school is that it comes about as a result 
in a change in the curriculum. With the exception of language usage 
and allied subjects, the subjects in which girls show their greatest 
superiority are not found among formally organized subjects of high- 
school curricula. On the other hand, those subjects in which boys 
appear to excel in the elementary school, the sciences and the social 
sciences, loom relatively large in high school curricula. Unfortunately 
the data at hand are too meager to permit any very positive statement 
about male superiority in the sciences and social sciences at the ele- 
mentary-school level. In the two subjects that run through both the 
elementary and high school with respect to which we do have ample 
data—reading and language usage—we find the two sexes maintaining 
the same relative positions throughout. 


BIBLIOGRAPHY 


(1) Atkinson, C.: “‘The effect of sex differences in the study of general 
science.” Journal of educational research, Vol. xxtv, 1931, pp. 61-66. 

(2) Ayres, L. P.: Laggards in our schools. Ch.14. New York: Russell Sage 
Foundation, 1909. 

(3) Caldwell, F. F. and Mowry, M. D.: ‘‘Sex differences in school achieve- 
ment among Spanish-American and Anglo-American children.” 
Journal of educational sociology, Vol. vim, 1934, pp. 168-173. 

(4) Carroll, H. A.: ‘Influence of the sex factor upon appreciation of liter- 
ature.” School and society, Vol. xxxvu, 1933, pp. 468-472. 

(5) Commins, W. D.: ‘“‘More about sex differences.”’ School and society, 
Vol. xxvir1, 1928, pp. 599-600. 

(6) Eells, W. C. and Fox, C. S.: “‘Sex differences in mathematical achieve- 
ment of junior college students.” Journal of educational psychology, 
Vol. xx1m1, 1932, pp. 381-386. 

(7) Foran, T. G. and O’Hara, C. S.: “‘Sex differences in achievement in high- 
school geometry.” School review, Vol. xuim, 1935, pp. 357-362. 

(8) Hurd, A. W.: “Sex differences in achievement in physical science.” 

Journal of educational psychology, Vol. xxv, 1934, p. 70. 








Sex Differences in Achievement 667 


(9) Johnson, G. R.: ‘‘Girls lead in progress through school.’”’ American 
school board journal, Vol. xcv, 1937, pp. 25-26. 
(10) Jordan, A. M.: “Sex differences in mental traits.’”’ High school journal, 
Vol. xx, 1937, pp. 254-261. 
(11) Lincoln, E. A.: Sex differences in the growth of American school children. 
Ch. 4. Baltimore: Warwick and York, 1927. 
(12) Lund, F. H.: “‘Sex differences in type of educational mastery.”” Journal 
of educational psychology, Vol. xx111, 1932, pp. 321-330. 
(13) Moore, J. E.: “‘A further study of sex differences in speed of reading.” 
Peabody journal of education, Vol. xvm, 1940, pp. 359-362. 
(14) Perry, W. M.: “Are boys excelling girls in geometric learning?” Journal 
of educational psychology, Vol. xx, 1929, pp. 270-279. 
(15) St. John, C. W.: “The maladjustment of boys in certain elementary 
grades.”’ Educational administration and supervision, Vol. xv-11, 1932, 
pp. 659-672. 
(16) Traxler, A. E.: ‘Sex differences in rate of reading in the high school.” 
Journal of applied psychology, Vol. xtx, 1935, pp. 351-352. 
(17) Webb, P. E.: ‘‘A study of geometric abilities among boys and girls of 
equal mental abilities.” Journal of educational research, Vol. xv, 1927, 
pp. 256-262. 








wits 


+ te ee a. " 
’ 3 te ws oie 
“——_—_ ne CO tae ae ad 


Ge et Se RS ee 


me wale 





A TEST OF PREFERENCES FOR TRADITIONAL 
AND MODERN PAINTINGS 


ELIAS KATZ 
New York City 


The aim of the test discussed here was to measure children’s 
preferences for traditional paintings, as compared with their prefer- 
ences for modern paintings. No attempt was made to evaluate the 
aesthetic merit of either traditional or modern paintings. 

The traditional paintings for the test were selected from the 
Picture Study list in the Elementary Art Syllabus and Course of Study! 
(hereafter referred to in the text as the Syllabus) of a large public 
school system. The paintings recommended in this list were typical 
of those generally included in picture study lists over the country. 
Evidence of this was the fact that almost three-quarters of the paint- 
ings most often recommended for picture study in fifty-six widely used 
art courses of study? appeared in this Syllabus. In addition, the 
Syllabus, when adopted in 1931, reflected current standards as to 
suitable paintings for picture study in the elementary schools of a 
large public school system. At the time the test was constructed, 
(1940-1941), the Syllabus was in the process of revision in the light of 
changing standards of taste and greater understanding of children’s 
preferences. 

Altogether there were one hundred thirteen paintings recom- 
mended for Grades 1A through 8B in the Syllabus. Of these, thirty 
paintings were recommended for the seventh and eighth grades, and 
were not used, since the investigation was limited to grades below 
the seventh. Of the remaining eighty-three paintings, nineteen were 
not used for various reasons. In eight instances, modern paintings 
to be compared with traditional paintings could not be found to meet 
with the approval of a majority of the judges. Four traditional paint- 
ings were not available in color. In four instances, the recommended 
painting was actually a modern painting as defined in this study and, 
therefore, could not be used. Three paintings by Winslow Homer 
were not included, because of difficulty in classifying him as traditional 





1 New York City, Elementary Art Syllabus and Course of Study, 1931, pp. 11, 
12, 27, 43, 44, 57, 58. 
2 Morrison, Jeannette G.: Children’s Preferences for Pictures Commonly Used 
In Art Appreciation Classes, 1935, p. 15. 
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or modern. In all, sixty-four traditional paintings were used for the 
test. 

The modern paintings to be matched with the traditional paintings 
for the test, were selected from the vast range of paintings by modern 
artists. The paintings of modern European, Mexican, and American 
artists, from the school of Impressionism through contemporary social 
painters, were well represented. 

The omission of paintings by many great masters of modern paint- 
ing was in some instances due to the paucity of available color repro- 
ductions of their work. In other cases, subject-matter which was 
characteristic of some modern artists, like abstractionists and surreal- 
ists, could not be matched with any of the traditional paintings. 

After the paintings were matched in subject-matter, they were sub- 
mitted to eleven competent judges for appreval.! To show the paint- 
ings to the judges, 9 X 11 inch manila folders were used. A tradi- 
tional painting and a modern painting similar in subject-matter were 
placed in each folder. Each judge was given the following instruc- 
tions, together with a sheet on which to record approval, disapproval, 
or alternate suggestions: 


This is a study of children’s preferences for traditional paintings as com- 
pared with modern paintings. 

These pairs of paintings are to be shown to children in Grades II through 
VI by means of lantern slides. 

Paintings by modern artists have been selected and paired with traditional 
paintings. The basic criterion for pairing was similarity in subject-matter. 

Please indicate on the accompanying blank sheet whether you approve or 
disapprove of the choice of modern painting to be matched with the traditional 
painting. If you disapprove, please try to suggest another modern painting, 
similar in subject-matter, which is available in a color reproduction. 


It was arbitrarily assumed that a pair of paintings should be 
included in the test if it was approved by a majority of the judges. 
Expressed quantitatively, a pair of paintings was included if there was 
agreement of 55 per cent or more among the eleven judges. 

It was found that about six-sevenths of the pairs of paintings 
included in the test were approved by eight or more judges. The 
average percentage of approval of all pairs of paintings by all judges 





1Including eight members of the faculty of the Department of Fine Arts, 
Teachers College, Columbia University, two art psychologists on the faculty of 
Teachers College, Columbia University, and the Assistant Director of Fine Arts, 
New York City Public Schools. 
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was 84.2 per cent. The minimum percentage of fifty-five per cent, 
and the 84.2 per cent average percentage of approval among eleven 
judges compared favorably with the criterion of expert agreement 
used by Meier (sixty per cent),! by McAdory (sixty-four per cent),? 
and by Faulkner (sixty-five per cent),* in their well-known tests of art 
judgment. 

Color reproductions of all the approved pairs of paintings were then 
photographed in Kodachrome, and mounted in pairs in 2” x 2’ 
lantern slides. Following are some advantages of using slides for 
showing the paintings: 

(1) Pairs of paintings can be shown simultaneously, and can be 
changed with a minimum loss of time. 

(2) Discrepancies in the actual size of pairs of paintings are reduced 
to a minimum. 

(3) Pairs of paintings could be shown to a large group of children 
at the same time, thus making possible group testing. 

(4) The influence of the teacher, the experimenter, and any dis- 
tracting influences could be reduced to a minimum. 

On the basis of preliminary experiments, a standardized procedure 
was adopted. A classroom with dark shades was made available to 
the experimenter. A 56’’ X 72” portable glass-beaded screen, its 
lower edge 36” from the floor, was placed in the center at the front of 
the room. A300 Watt 2” xX 2” lantern slide projector, with an auto- 
matic cooling attachment, was placed in the center of the room about 
fifteen feet from the screen. The projected image was about 36” high. 
The shades were drawn in such a way as to allow some light to come 
into the room so that the subjects could see their answer sheets. How- 
ever, no light was permitted to strike the screen directly. The answer 
sheets were standardized forms, and the directions for administering 
the test were standardized.‘ 

The validity of the test was demonstrated in two ways. In the 
first place, the test may be considered valid in terms of the high agree- 
ment (84.2 per cent) among experts as to the validity of individual 





1 Meier, Norman C.: ‘‘A Measure of Art Talent,” in Psychological Monographs, 
Vol. xxxrx, No. 2, 1928, p. 196. 
2 McAdory, Margaret: The Construction and Validation of an Art Test, 1929, 


p. 10. 
3 Faulkner, Ray: An Experimental Investigation Designed to Develop Tests to 


Measure Understanding and Appreciation, 1937, p. 78. 
‘ The lantern slides are available for use in similar experimentation, and may 


be secured at nominal cost from the author. 
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items in the test. A second indication was the extent to which the 
test differentiated among 2437 children when grouped by grade, sex, 
and school. It will be noted in Table I, that mean preference scores 
on the test tended to increase from the second through the sixth grade; 
that girls tended to have higher mean preference scores than boys; 
and that there were variations among mean preference scores from 
school to school.! 


TasBLE [I].—ANALYSIS OF VARIANCE OF MEAN PREFERENCE SCORES 
oF 2437 CHILDREN TO WHOM THE TEST OF PREFERENCES FOR 
TRADITIONAL AND MODERN PAINTINGS Was ADMINISTERED 
Children Grouped According to Grade, Sex, and School 


SOURCE OF Sum oF DerGrREeEsSOF MEAN 

VARIATION SQUARES FREEDOM SQUARE F Fo 
a EE 1239 .5880 49 
Grades......... 692 .3783 4 173.0946 34.78% 3.83 
OE ig i sa 4 61.9831 ] 61.9831 12.45° 7.31 
Schools......... 286.1615 4 71.5404 14.38% 3.83 
Remainder...... 199 .0651 40 4.9766 


Nore: An important limitation of the analysis of variance of the mean prefer- 
ence scores in Table I, was the fact that the number of individuals in each cell 
(that is, the number of children in parentheses after each mean preference score) 
varied from cell to cell. Under ideal conditions, the number of individuals in each 
cell should be equal, or the number of individuals in each cell should be propor- 
tional to the group of which it isa part. The effect of slight deviations from equal 
numbers of individuals, or proportional numbers of individuals in each cell, may 
be compensated for to some extent by laborious calculations, but will not be 
eliminated entirely. For the preliminary form of the test, it was assumed that 
the mean preference scores in each cell were based on proportional numbers of 
individuals. 

* Significant at the .01 level. 

By means of the analysis of variance technique,? it is demonstrated 
in Table II that it was extremely unlikely (less than one chance in 
one hundred) that the observed variation of mean preference scores 
among grades, between boys and girls, and among schools, could be 
attributed to chance. It may, therefore, be concluded that on the 
whole the test was valid in the sense that it did differentiate among the 
children, when grouped by grade, sex, and school. 





1 The higher the score, the greater was the preference for traditional paintings; 
the lower the score, the greater was the preference for modern paintings. 

2 Snedecor, George W.: Calculation and Interpretation of Analysis of Variance 
and Covariance, 1934, pp. 1-96. 
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The validity of individual items in the test was quantitatively esti- 
mated as the degree to which success or failure on the individual item 
was an indication of success or failure on the whole test. “Item” in 
this sense refers to a pair of traditional and modern paintings, of which 
there were sixty-four in the test. Preference for the traditional paint- 
ing was arbitrarily called ‘‘success,’’ and given a value of 1. The 
correlation between success (or failure) on the item, and total score 
on the test, was computed by the bi-serial r technique. Each item in 
the test yielded its own particular bi-serial r, which could be con- 
sidered a quantitative estimate of that item’s validity. 

Table III has listed the sixty-four items in the test, in the order of 
their item validity, as computed for 2437 subjects. The validity 
coefficients ranged from .690 for Items 22 and 52 to —.115 for Item 30. 

The significance of item validity will be more apparent through 
examination of actual pairs of paintings with high- and low-validity 
coefficients. The five items with the highest validity coefficients were: 


22. Van de Velde—Entrance to a 


EET CR Marin—Seascape 
52. Turner—Fighting Temeraire.... Schmidt-Rotluff—Harbor 
15. Troyon—Cattle............... Cook—White-faced Cattle 
48. Rubens—The Painter’s Sons.... Picasso—Pierrot and Harlequin 
43. Greuze—The Dead Bird........ Laurencin—Girl 


Because of their high validity coefficients, these pairs of paintings 
were the most satisfactory measures of children’s preferences for 
traditional and modern paintings. 

On the other hand, the five items with the lowest validity coeffi- 
cients were: 


64. Stuart—Washington................ Cezanne—Self-Portrait _ 

54. Hals—Nurse and Child............. Renoir—Gabriel et Jean 
9. Shannon—Fairy Tales.............. Renoir—Three Daughters 
6. Hitchcock—In the Tulip Fields...... Monet—Gladioli 


30. Sargent—Carnation, Lily, Lily, Rose.. Renoir—In the Meadow 


Because of their low validity coefficients, these five pairs of paint- 
ings were the least satisfactory measures of children’s preferences for 
traditional and modern paintings. This may be due to various 
reasons. In the Hals-Renoir, and the Shannon-Renoir pairs of 
paintings (Items 54 and 9), the lack of difference between the modern 
painting and the traditional painting may have been the cause. In 
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TaBLeE IJI.—Pairs oF PAINTINGS IN THE TEST OF PREFERENCES FOR 
TRADITIONAL AND MODERN PAINTINGS ARRANGED IN ORDER OF 
S1zE OF THE ITEM VALIDITY COEFFICIENT 
Based on 2437 Cases 














Validity" Validity Validity — 
Item? | coeffi- | Item | coeffi- | Item | coeffi- | Item Validi ty 
‘ ‘ coefficient 
client cient clent 
22 .690 32 .580 58 | .455 59 .335 
52 .690 34 . 580° 13 .430¢ : 19 | .330 
15 .670 36 .570° 42 425° | 18 .320 
48 .660° 60 .570 63 .425 46 .3054 
43 .655° 20 .560 | 38 .420° 14 2804 
23 .650 62 .545 1 | .410 26 .270 
27 | .650 | 25 | .540 | 7 | .410 | 10 240 
57 .650° 35 .535 44 .410 11 2404 
47 .640 49 .535 50 .410 12 . 230 
39 .610 53 .520 3 .390 21 .210 
55 .610° 5 .510 31 .390 51 210° 
37 .600 16 .490 33 .385 64 . 1804 
28 590 | 45 .485 17 .3804 54 . 150 
29 .590 | 2 .485 8 .360 ) .090 
40 .590 4] .480 61 .350 6 .040 
56 .585 24 .460 4 .340 30 — .115° 





























* Item number was the order in which the pair of paintings was shown in the 


test. 
*’ Computed by the formula: 
— M,—Mr p 
OT Zz 
See Dunlap, Jack W.: “Nomograph for Computing Bi-Serial Correlations,” 
Psychometrika, Vol. 1, 1936, pp. 59-60. 
¢ Variation among preferences for the traditional painting and for the modern 
painting in the pair of paintings could be attributed to the three factors of variation 
among grades, variation between boys and girls, and variation among schools. 
¢ Variation among preferences for the traditional painting and for the modern 
painting in the pair of paintings could not be attributed to the three factors of 
variation among grades, variation between boys and girls, and variation among 


schools. 
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the Stuart-Cezanne pair (Item 64), the patriotic association with 
Washington’s portrait by Stuart, probably operated to make this 
pair of paintings an unsatisfactory measure. 

The negative item validity coefficient of the Sargent-Renoir pair 
(Item 30) indicated that preference for the modern painting in the pair 
was usually associated with total scores indicating greater preference 
for traditional paintings. The reason for this unusual discrepancy is 
not apparent. 

Another indication of the validity of individual items in the test 
was the extent to which each item differentiated among the 2437 
subjects when grouped by grade, sex, and school. Just as there were 
variations from grade to grade, between boys and girls, and from 
school to school, in mean preference scores, there were similar vari- 
ations in individual items. The analysis of variance technique was 
applied to each of the sixty-four pairs of paintings. It was found that 
eleven items differentiated among the children with respect to grade, 
sex and school, and from this point of view may be considered most 
valid. On the other hand, there were five items which did not differ- 
entiate at all, and these may be considered least valid. It is inter- 
esting to note in Table III where these sixteen items have been 
identified, that with two exceptions, (Items 51 and 30), items with 
high validity coefficients were characterized by this power to dis- 
criminate with respect to grade, sex, and school, while items with low 
coefficients did not possess this power to differentiate. Under ideal 
conditions, items possessing highest validity would have the highest 
validity coefficients, and would differentiate most clearly in terms of 
the factors being tested. 

Besides measuring accurately and truthfully what it set out to 
measure, the test also measured consistently. In Table IV, reliability 
coefficients for the test by odd-even correlation, and by retesting have 
been reported. These reliability coefficients averaging around .75 and 
.80, indicated that the test was reliable enough for group testing. 

On the basis of the available reliability coefficients for the test, it 
was possible to estimate the reliability of the test if more items of the 
same nature were to be included. The estimated reliability coeffi- 
cients through the inclusion of one and one-half the number of items 
(bringing the test up to ninety-six items), and of three times the number 
of items, (one hundred ninety-two items), are reported in Table V. 
Increasing the test to ninety-six items would not significantly increase 
the reliability coefficient as estimated for sixty-four items. This 
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would indicate that the present reliability of the test could not be 
greatly raised through the inclusion of thirty-two more items. On the 
other hand, tripling the number of items might raise the reliability of 


nie the test to the level where the test might be useful for individual 


testing, if desired. 


TABLE IV.—RETESTING AND ODD-EVEN RELIABILITY COEFFICIENTS 
FOR THE TEST OF PREFERENCES FOR TRADITIONAL AND MODERN 

















a PAINTINGS 
a | Administered to Boys and Girls, Sixth Grade, School C, November 
t 1940 
| Retesting Odd-even 
ee coefficient coefficient 
1; | 
) N r N r 
mini | 
re Sixth-grade Boys, School C........| 95 | .723 | 105 | .813¢ 
| Sixth-grade Girls, School C........ 104 .736 111 .730° 
ES ee a Ve 199 & 723 | 216 .759¢ 











* Corrected by Spearman-Brown Formula. See Louis L. Thurstone, The 


ei Reliability and Validity of Tests, p. 37. 


TaBLE V.—ESTIMATED RELIABILITY COEFFICIENTS FOR ONE AND 
ONE-HALF TIMES, AND THREE TIMES THE PRESENT LENGTH OF 
THE TEST OF PREFERENCES FOR TRADITIONAL AND MODERN 


PAINTINGS 
BASED ON RETESTING BASED ON ODD-EVEN 
CoEFFICIENT? COEFFICIENT? 


One and one-half times 
(ninety-six items)... . .797° .825° 
Three times (one hun- 


dred ninety-two 
ee .887° 904° 


* See Table IV. 
’ Corrected by Spearman-Brown Prophecy Formula. See Louis L. Thurstone, 


The Reliability and Validity of Tests, p. 36. 


Knowing the reliability of the test at one grade level in one school 
(Sixth Grade in School C), it has been possible to estimate what the 
: reliability of this test should be in order to be effective (a) among all 
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the children in School C and (b) among all the grades in the five schools 
in this study. These estimated reliability coefficients are reported in 
Table VI. It is interesting to note that the reliability coefficients 
observed for the one group were not significantly different from the 
estimated reliability coefficients which the test should have in order to 
be effective in all grades in all schools. This indicated that the test 

| in its present form was reliable enough for use with all the groups to 
whom it was administered. 


TaBLE VI.—EsTIMATED RELIABILITY COEFFICIENTS WHICH THE TEST 
OF PREFERENCES FOR TRADITIONAL AND MODERN PAINTINGS 
SHOULD Have, IN ORDER TO BE EFFECTIVE IN ScHOOL C, 
AND IN ALL Five ScHOOLS 

BASED ON RETESTING BASED ON ODD-EVEN 
COEFFICIENT? CoEFFICIENT* 
For all classes in School 
OP a .766° .863° 
For all classes in all five 
schools (schools A, B, 


ee s,s 796° 904° 
* See Table IV. 








, b ; >. a rn — Ri) 
Estimated by the formula: ¥ ans san 
See Kelley, Truman L.: “The Reliability of Test Scores,” Journal of Educational 


Research, Vol. 111, 1921, pp. 370-379. 


As a valid and reliable measuring instrument, the test in its pre- 
liminary form provided quantitative estimates of the nature of and 
changes in elementary-school children’s preferences for traditional and 
modern paintings at the elementary-school level. With appropriate 
revisions, the test’s validity, reliability, and range of usefulness could 
be extended to a point where it might be helpful in studying preferences 
for other forms of art, at different age levels, in different types of 
schools, ete. 





1 Kelley, Truman L.: “The Reliability of Test Scores,” Journal of Educational 
Research, Vol. 111, 1921, pp. 370-379. 
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REMEDIAL READING PROGRAMS: 
EVIDENCE OF THEIR DEVELOPMENT 


FRANCES ORALIND TRIGGS 


University of Minnesota Testing Bureau 


The University of Minnesota Press has recently made a survey of 
the status of remedial reading programs in institutions of higher 
education. Questionnaires were sent to 1528 deans of liberal arts 
colleges, presidents or deans of small colleges, teachers colleges, and 
normal schools, of whom three hundred supplied the requested infor- 
mation. Of those replying, one hundred eighty-five now have remedial 
reading programs and at least seventy-three more will offer such a 
service for the first time next year. The survey yielded so much 
valuable information that it seems worth while to report the results in 
the literature for the guidance of those who are considering setting 
up these services. 

Educational literature has been charged with being profuse, 
redundant, and repetitious and thin in content. However, the 
literature does serve a major purpose in the development of special 
services at the college level. Reviewing the history of any of the 
student services that have developed recently, we can see in the 
literature the pattern of their evolution. In the first articles we find 
an expression of a felt need. Then some one gets an idea as to how 
that need may be met and usually describes it in the literature, either 
before or after trying it out. Sources of materials may or may not be 
discussed at this point. Then a little later appear the surveys to 
discover what other institutions are doing to meet the need, 1.e., to 
discover who is interested in it, what departments are fostering it, 
what materials are being used, and so on. Next come evaluations of 
the different methods of meeting the need. These evaluations are 
both good and poor, rough and fine. After a few years we may find 
articles concerning the codrdination of the new service with the 
existing program. It has had its day in the literature and now takes 
its proper place in the educational program; the literature has already 
turned to new services. The stages described overlap, of course, but 
the prevailing pattern is usually clear. 

Recent graduate students can recall that the literature has intro- 
duced them in just this manner to such services as freshman week, 
orientation courses, entrance testing, and coérdination of personnel 
services. For them the periodical literature takes the place of the 
history book which has not yet been written. Through the literature 
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the person currently working in the field meets and keeps in touch with 
services at other institutions and thus is better able to serve his own 
institution. 

To determine the maturity of any educational service, then, one 
may turn to the literature and see what stage of the pattern has been 
reached. Do current articles concern themselves predominantly 
with descriptions, surveys, or evaluations? Or are they largely still 
in the theoretical stage? 

Following this procedure makes it clear that remedial reading 
services at the college level are not yet past their adolescence, for the 
articles now current are hardly out of the theoretical and survey stages. 
However, the literature that is available can be a real help in showing 
the way to those who are interested in bringing remedial reading 
services to maturity. 

Faced with the problem of deciding whether or not to publish a 
textbook and a manual of exercises in remedial reading, the University of 
Minnesota Press, like all publishers, wanted to know the potential mar- 
ket for such books. It needed information on the following points: 


1. What is the status of remedial work? 

(a) To what extent are colleges doing work in remedial reading? 

(6) How many students are served by existing remedial reading services? 
Has a saturation point been reached, or how many more are likely to 
be served in the near future? 

(c) Are available commercial materials being used in developing better 
reading habits? Are they considered adequate to meet the need? 

Do those who are aware of the need for remedial work feel that a new text- 

book and manual are needed? 

3. To what group should new materials be addressed? 

(a) Who is actually doing the remedial work with students? (Instructors 
in the content fields, personnel officers, specialists?) 

(6) Assuming that there are some who would like to do such work who are 
not now trained for it, who are they? (What content fields do they 
represent or in what capacity are they now serving?) 

(c) From such information can the most suitable form for new materials 
be determined? 

4. Where and how is remedial reading work now being done? That is, in 
what situation is the material likely to be used? (As a separate course? 
As a part of how-to-study, orientation, education, or psychology courses?) 
Is the work done mainly in group situations or is it largely individual in 
nature? 


no 


To get answers to these questions, the Press turned first to the 
recent surveys in print, one reported by Charters in the Journal of 
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Higher Education' and one by Witty in School and Society.2 From 
each of these surveys certain pertinent information was gleaned. 
However, many of the foregoing questions were still unanswered, and 
the Press decided to make a further survey of its own. In this report 
the results of the Press survey will be compared wherever possible 
to the results obtained from the other two surveys. 


WHAT IS THE STATUS OF REMEDIAL WORK? 


Each of the three surveys at least partially answers the question, 
“To what extent are colleges doing work in remedial reading?” 
Table I summarizes the data. In the Charters and the Minnesota 
Press surveys (omitting the results in the Witty survey because the 
data as given are not comparable) the percentages of answering schools 
that have programs are almost identical. 

How many students are served by existing remedial reading serv- 
ices? The Press survey showed that the one hundred eighty-five 
institutions are serving 7636 students annually, an average of forty-one 
students to a program. How long each program has been in operation 
was also determined. To the extent that the data so obtained overlap 
the results of the Charters survey, the two are again in agreement. 
The Press found that twenty-five institutions started their work this 
last year; seventy-two have been in operation for two or three years; 
and thirty-seven, for four or five years; leaving only twenty-one insti- 
tutions (of those who answered this question) which have programs 
over five years old. The Charters survey says on this point: ‘‘The 
activity [remedial] reading service] is new for most institutions. With- 
out asking the date of inception we find that large numbers are 
organizing the service this year for the first time; in a large number of 
cases the work has been running for only two or three years; in a sub- 
stantial number a longer period.”” From Witty’s conclusion that in 
general the poor readers are located but instruction is not available to 
them all, it would seem probable that a saturation point in these 
services has not yet been reached. 

In the Press survey those institutions that do not have remedial 
reading services were asked to indicate when they hoped to start one. 
Seventy-three said they hoped to start a reading program in the Fall of 
1942. If we multiply this number by forty-one (the average number of 
students being served by'a program at the present time) we find that 





1 Charters, W. W.: ‘Remedial Reading in College.’”’ The Journal of Higher 
Education, Vol. x11, No. 3, March 3, 1941, pp. 117-121. 

2 Witty, Paul A.: “‘ Practices in Corrective Reading in Colleges and Universi- 
ties.” School and Society, Vol. tur, No. 1353, November 1940, pp. 564-566. 
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2993 more students will be served next year than this year. Obviously 
the saturation point has not yet been reached. 


TaBLeE I.—ExtTent or REMEDIAL READING SERVICES 
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1 The Witty survey does not indicate how many of the questionnaires sent out 
were returned; thus this percentage cannot be compared directly with the results 
of the other two surveys. However, because the number of questionnaires he sent 
was small, it is probable that the per cent of returns was highest in this study; 
if so, and if the sampling institutions covered were representative, the per cent of 
institutions offering remedial reading as indicated by his data may be a more 
reliable estimate than was obtained from the other reports. 

? Additional replies not used in the survey bring the total to 336 now received: 
percentage of replies, 21 plus. The questionnaire was not mailed independently, 
but as an enclosure in a mailing on an entirely different subject. It was not 
mailed first class. No attention was called to the reading questionnaire in the 
primary mailing piece. The response to the primary mailing piece (about reprints 
of seventeen lectures on ‘‘War Comes To America” at $1.00 for the series) was 
2 per cent. The response to the questionnaire was over 21 per cent. If the 
questionnaire had been mailed independently, it might have had an even greater 
response. 


ARE ADDITIONAL MATERIALS NEEDED? 

The Charters and Witty surveys both indicate that the available 
commercial materials are being utilized somewhat, but that there is 
evidence of dissatisfaction with them. It is very probable that lack 
of training and lack of time are the principal factors retarding the 
development of materials appropriate to the conditions. 

Do those who are aware of the need for remedial work feel that a 
new textbook and manual are needed? The manuscripts the Press 
was considering were described and the question was asked, ‘‘ Are you 
interested now in the textbook described? In the exercises and 
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manual?” Of the three hundred five institutions answering the 
questionnaire, seventy-seven per cent (two hundred seventy) said 
they were interested in the text; seventy-six and seven tenths per cent 
(two hundred thirty-four) said they were interested in the manual. 
This would certainly seem to indicate that there is a felt need for such 
materials. In an article summarizing the factors that are deterring 
the development of remedial reading services,' lack of inexpensive 
remedial materials is listed as one of the most important. Both the 
Charters and Witty surveys list the commercial materials frequently 
used and both mention the fact that colleges are beginning to develop 
materials of their own. 

When those who received the Press questionnaire were asked to 
classify the types of materials needed, they answered as follows: Instru- 
ments, seventy-three; tests, one hundred nine; textbooks, one hundred 
fourteen; manuals, one hundred twenty-three; exercises, one hundred 
forty-six. This response would indicate that a textbook is needed 
which will (1) survey the tests that are available and indicate where 
they are useful, (2) describe various diagnostic measures, (3) show 
those in the field how to develop exercises that will meet the needs of 
the students with whom they are working. 


TO WHAT GROUP SHOULD NEW MATERIALS BE ADDRESSED? 


Any publishing house must take care that the materials it puts on 
the market are appropriate in tone and style for the potential users, 
or the call for the materials will be poor. Therefore, the Press not 
only wanted to know the possible number of users, but also who the 
users are likely to be, with what training and what interests. To get 
this information, it asked each dean to indicate the name and depart- 
mental classification of those persons now in charge of remedial work 
in his college. The tabulation of replies showed: 


ae a a ga a ols ay ey 23 
a a a i se Ea dew ao Ow wen e 45 
Instructors in education......................... 63 
re is hae bade bbwew eee’ 50 
ee ey ass kwencsnd cuseesiect- © 
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1 Triggs, Frances Oralind: ‘‘Current Problems in Remedial Reading for College 
Students.”” School and Society, Vol. tu11, No. 1369, March 22, 1941, pp. 376-379. 
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To determine who particularly feels the need for remedial reading 
programs the Press asked each dean, ‘‘What member of the staff is 
primarily interested in the work and to what department does he 
belong.”” The response was as follows: 


pO ee 
eee 35 
eg re 66 
I od. i ve bd ede kee aneeees 55 
I 6 sc ano Wid v's ae eee mw ROS 3 
ee ae oe Ai sae ace 2 
NN i os Sale os cn ae ey l 


Thus it can be seen that the interest in this work centers largely in 
the administration and the departments of education, psychology, and 
English. These data suggest that those interested are those who have 
some responsibility for correcting the difficulty. Obviously, then, 
materials should not be written in the jargon of any one specialty but 
as simply and clearly as possible that they may be generally under- 
stood by the diverse group of those working in the field. 


FOR WHAT TYPES OF REMEDIAL SITUATIONS SHOULD MATERIALS BE 
PREPARED? 


Finally, the Press wanted to know what type of work is being 
given. The Charters survey suggested a wide diversification of the 
remedial services. Some were given as a part of established courses: 
How-to-study, English composition, or psychology. Others were 
independent courses, in which case they are often given under the 
auspices of the administration or of instructors in psychology, edu- 
cation, English, or speech. The Press survey bears this out. Of the 
one hundred eighty-five institutions offering remedial work, thirty- 
seven have separate courses seemingly as a part of no departmental 
organization, while thirty-nine have separate courses usually in the 
department of psychology or of education. Forty give their work as a 
part of the how-to-study course, fifty-three as a part of the English 
courses, and three as a part of freshman orientation. One institution 
each classified its work in one of the following departments: Research, 
communication laboratoriés, speech clinics, methods courses in reading, 
and the health service. Some institutions offer help in more than one 
of the classifications mentioned. 
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On this point Witty concludes from his data: ‘‘ Diagnostic and 
remedial [reading] instruction is offered usually by the Department of 
Psychology or of Education. Three schools give help in reading as a 
part of their counselling service; two have reading clinics and give aid in 
special classes. The remainder distribute the work among various 
departments and schools.! 

According to the Press survey five institutions have clinics and 
sixteen do their work with individual cases. These numbers indicate 
the extent to which individual work is done. Witty’s conclusions on 
this point, too, are roughly in agreement. It might be expected that 
for reading difficulties more individual work would be done, and a 
summary of the literature indicates that many people feel individual 
remedial work to be necessary for serious cases. It is probable that 
more individual work is not done because those doing remedial reading 
work at the present time have not had adequate training to handle 
these cases? and because the load is too heavy now to take care of them 
allin that manner. Individual work is expensive. A combination of 
clinical and group work promises to be a solution to this problem in 
some situations. * 

In the Press survey nine individuals expressed special opinions as to 
where the work ought to be done: Two said they would like to see it as 
a part of courses in education or psychology; five said as a part of the 
guidance program; and two as a part of the freshman orientation 
program. One said it was now an extracurriculum activity. Prob- 
ably any institution offering non-credit work would classify it as an 
extracurriculum activity. 

These findings indicate even more explicitly than those of the 
Charters survey that diversification is certainly the rule as far as 
responsibility for organization of the work is concerned. In general, 
however, the work is found in departments responsible for building 
language habits or in those especially interested in, or in teaching, 
techniques for the adjustment of the individual. 

The results of these surveys should be of particular importance to 
institutions that are just beginning or have just begun a remedial read- 
ing service. The need for such a program is evident. According to 





1 Witty, Paul A.: ‘Practices in Corrective Reading in Colleges and Universi- 
ties.” School and Society, Vol. u11, No. 1353, November, 1940, p. 566. 

2See Witty’s survey and Triggs’ article previously cited. 

’ Triggs, Frances Oralind: “Remedial Reading.” Journal of Higher Educa- 
tion, October, 1941, p. 371-377. 
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the Charters survey the worthwhileness of the work is vouched for by 
nearly every institution that has tried it (though admittedly no rigor- 
ous evaluation of such work has yet been made available). It is 
possible to fit the remedial training into various niches now available 
on nearly every campus (how-to-study, English, education, and 
psychology courses, or separate credit or non-credit courses). Witty 
states that of forty-one institutions, twenty-eight do not give credit and 
thirteen do give credit (probably because the work is given as a part of 
a credit course). 

The remedial work can be started as a group program, either as a 
separate course or as a part of a course now in the curriculum. Al- 
though it is recommended that cases of serious disability be handled 
individually, some institutions have found the work to be better fitted 
to the rigorous class schedule if a semi-clinical set-up is maintained, and 
a majority of institutions carry on the remedial reading in small groups. 
Where a specially-trained individual was not available to supervise this 
work, a manual of exercises has usually been followed. Such a manual 
may not be adequate for seriously retarded cases, but it does serve the 
merely inefficient reader. 

From experience in the use of available materials with small groups, 
the untrained instructor will gain some background in remedial tech- 
niques. The inadequacy of prepared materials for the more serious 
cases will undoubtedly become apparent. Then the instructor with 
experience in group remedial reading work and some training in 
psychology or educational psychology can learn to adapt materials to 
all types of individual cases. He can learn this from the periodical 
literature or from textbooks, or through summer or extension courses, 
reading conferences, and consultation with those who have more 
training and experience in the field. 

Some institutions are making progress in developing remedial work, 
and schools planning to offer the service may well learn from their 
experiences. The main obstacles—lack of trained personnel to handle 
the work, lack of adequate diagnostic devices, lack of inexpensive 
remedial materials, and lack of adequate techniques for evaluation of 
the work—will be overcome only by extending the application of 
successful techniques. Such a development takes place slowly, but by 
recording efforts in that direction we can hasten the coming of periodi- 
cal literature and textbooks that will show maturity and thus be of real 
service to all doing work in the field remedial reading. 














A CORRECTION FOR THE EFFECT OF TIED RANKS 
ON THE VALUE OF THE RANK DIFFERENCE 
CORRELATION COEFFICIENT 


DANIEL HORN 
Psychological Clinic, Harvard University 


It is customary in computing the rank difference correlation coeffi- 
cient to deal with the problem of tied scores by giving each a rank equal 
to the average of the ranks they would normally occupy. This paper 
provides an easy method of correcting for the error which results—and 
the error may be surprisingly large—thereby making the corrected 
coefficient comparable to those calculated from rank orders without 
tied ranks. 

DuBois! has offered a solution to this problem, but his solution is 
inadequate for two reasons: (1) The computation involved is laborious 
and (2) the resulting coefficient is inaccurate; that is, it does not equal 
the coefficient that would result from applying the product-moment 
correlation formula to the ranks. The correlation coefficient, whether 
product-moment or rank difference, is an inverse function of the 
squared differences between paired measures which have the same 
mean and the same standard deviation. The usual method of handling 
ties equates the means of the two rank orders without equating the 
standard deviations. DuBois’ method gives an approximate equation 
of the standard deviations at the expense of having different means. 
Either method may result in a substantial error. 

The rank difference correlation coefficient, p, is computed from the 


formula 
6D? 


e=1!— NWT) @) 





where 2D? represents the sum of the squared differences between 
paired ranks. Formula (1) may be derived by applying the product- 
moment correlation formula to a set of paired variables in which the 
scores on each variable consist of all the integers from 1 to N. In 
other words, two rank orders are correlated by the product-moment 
method, the ranks being treated as thoug 1 they were numerical scores. 

The simplified form of formula (1) results from the mathematical 
fact that the mean (M) and the variance (c?) of a series of all the 





1 DuBois, P.: ‘Formulas and tables for rank correlation.” Psychol. Rec., 
Vol. 111, 1939, pp. 46-56. 
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A Correction for the Effect of Tied Ranks 


integers from 1 to N may be expressed in terms of N, 


unt! (2) 





and 
N?—1 
oc? = 12 (3) 





If tied ranks occur in the distribution, formula (2) remains exact. 
Formula (3), however, is now too large and must be corrected by a 
factor which depends on the number of sets of tied ranks and the 
number of ranks in each set; thus, the correct value of the variance in 
this case is 
N?-—-1 C 

=~ FF (4) 


S = 





where C represents the correlation factor due to ties. 
The correction factor, C, for any rank order may be calculated from 
the formula 





- (n)(n? — 1) 
C= 12 (5) 


where n represents, successively, the number of ranks in each set of 
ties. Table I gives the value contributed to C by any set of tied 
ranks containing from two to thirty members. 

Thus, for example, if a rank order contains one four-rank tie and 
one two-rank tie, from Table I, C= 5+ .5= 5.5. If N = 10, the 
variance, by formula (4), iso? = 8.25 — 5.5/10 = 7.70, instead of 8.25 
as would result from the incorrect use of formula (3). It is noteworthy 
that a single six-rank tie (C = 17.5) introduces a much more serious 
error than three sets of two-rank ties (C = 1.5). 

Using the notation p, to represent the product-moment correlation 
between ranks allowing for tied ranks in either or both rank orders, 
the correct formula is 


A N(N? — 1) — 6(C, + Cy) — 62D? 
/N(N? — 1) — 12C, WN(N? — 1) — 120, 


where C, and C, are the correction factors for the X and Y rank orders, 
respectively. If there are no ties, C. = C, = 0, and formula (6) 
reduces to formula (1). 

Although this formula is mathematically exact, it is of doubtful 
practical value. The chief charm of the rank difference correlation 
coefficient is the simplicity and ease with which it may be calculated 
from small samples. 
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Fortunately, however, the denominator of this complicated formula 
may be replaced by an approximate value [V(N? — 1) — 6(C, + C,)] 
without distorting the final result more than .01 in practically all cases 
in which it might be used. This approximation may be made since the 
square root of the product of two numbers (the geometric mean) is 
almost identical with (slightly smaller than) the arithmetic mean 


TaBLE I.—TuHE AMouNT CONTRIBUTED TO THE CORRECTION FACTOR 





2 a 
C= > oe 1) BY EACH n-RANK SET OF TIEs IN A RANK 


ORDER, FOR n = 2 TO n = 30 


(n)(n? — 1) (n)(n? — 1) 








” 12 7 12 

2 5 16 340 

3 2 17 408 

4 5 18 484.5 

5 10 19 570 
20 665 

6 17.5 

7 28 21 770 

8 42 22 885.5 

9 60 23 1012 

10 82.5 24 1150 
25 1300 

11 110 26 1462.5 

12 143 27 1638 

13 182 28 1827 

14 227.5 29 2030 

15 280 30 2247 .5 


of the two numbers if the two numbers do not differ by much. Thus 





VA-V/WB = & * ,, an approximation which improves as A and B 


approach each other in value. For example, +/900- +/800 = 850, 
which is very close to the actual value of 848.53. Since in formula (6), 
N(N? — 1) is almost always very large compared to 12C, the approxi- 
mation is a very close one unless the correction in one rank order is 
very large and the correction in the other rank order is very small, 
that is, unless (C, — C,) is large. If half of the ranks in one distribu- 
tion are tied at one position and the other half are tied at another 








A Correction for the Effect of Tied Ranks 689 


position, and there are no ties at all in the other rank order—a much 
more extreme case than is usually met with in experimental data—the 
error in this approximation is only one per cent, that is, less than .01 in 
absolute value. 

Using this approximation, formula (6) reduces to 


>D? 














pe = 1 — (7) 
N(N? — 1 
OO? - Ge + GY) 
which may be compared with formula (1), rewritten as 
=D? 
6 


The only difference between formulas (7) and (8) is that the sum of 
the two corrections is subtracted from the denominator of the second 
term, to give the more accurate result. 

The following example illustrates the use of formula (7), comparing 
it with formula (8): 








X Y D = DIFFERENCE 
RANK ORDER RANK ORDER BETWEEN RANKS D? 
l 7.5 6.5 42.25 
3.5 10 6.5 42.25 
3.5 7.5 4.0 16.00 
3.5 5 1.5 2.25 
3.5 7.5 4.0 16.00 
6 2.5 3.5 12.25 
7 7.5 5 25 
8 2.5 5.5 30.25 
i) 2.5 6.5 42.25 
10 2. 7.5 56.25 
(One 4-rank tie) (Two 4-rank ties) 
C,=5 Cy =5+5 = 10 | >D? = 260.00 
C.+ C, = 15 
N = 10; SO) = 165 
By the usual formula (8), p = 1 — 2®%65 = —.576; by formula 


(7), be =1— a on 1 - mit = —.733; the error in formula (8) 


is (—.576) — (—.733) = +.157. 
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It is interesting to note that if one of the two rank orders is reversed, 
2D? = 40.00, and by the usual formula p = 1 — #%g5 = +.758, 


whereas the corrected p, = 1 — aE ig = +-733 (note that C. + C, 





is still the same). If there are no ties and one rank order is reversed, 
the resulting p is the exact negative of the first value. This test of 
reversal holds for formula (7) when there are ties as well, but not for 
formula (8). If the product-moment formula is applied to these ranks, 
r = —.734 which varies by .001 from the result of formula (7). 


CONCLUSIONS 


A comparison of formula (8) with formula (7) leads to the following 
conclusions: 

(1) The error in the usual rank difference correlation formula 
caused by tied ranks is always positive, that is, positive coefficients 
are always too large and negative coefficients are always numerically 
too small. Also, the error increases in proportion to ~D?, that is, 
increases as p goes from the positive end of the correlation scale to the 
negative end of the scale. Both points are illustrated in the example 
above, for when p, = +.733, then p = +.758, a positive error in the 
latter of +.025; and when p, = —.733, then p = —.576, again a posi- 
tive error, but a much larger one of +.157. 

(2) In applying this method for correcting the rank difference 
coefficient for ties in the rank orders, it has been found most efficient 
to determine the value of C for each rank order when making up the 
ranks, and to place the numerical value of C below the rank order 
whenever it is used. 

(3) The resulting coefficient p,, as given by formula (7), where 
C, + C, are obtained from Table I, is directly comparable with the 
usual rank difference coefficient calculated from rank orders without 
ties, since it represents the product moment correlation formula 
applied to a set of ranks. 

(4) This method makes possible the correct use of rank difference 
correlation, even when many ties are present, with a negligible amount 
of extra labor. There should no longer be any reason for the calcu- 
lation of rank difference coefficients as much as .30 or .40 in error, as 
is now often the case. 








VERBAL TEST MATERIAL INDEPENDENT OF SPECIAL 
VOCABULARY DIFFICULTY 


D. O. HEBB 


Queen’s University * 


This paper describes a further revision of the Analogies tests (Van 
Wagenen®) as used by Weisenburg, Roe and McBride,* and some new 
but unstandardized sentence completion tests. The main value of 
the material is in meeting a need for verbal tests whose upper sensi- 
tivity is due to something else than vocabulary difficulty (Hebb and 
Morton’). Such tests are rare. A need for them is not likely to be 
felt by one who has used only composite tests in which the factors 
affecting total score are inextricably mixed up, but it can be shown to 
exist theoretically at least. It is a commonly recognized principle of 
testing that a rating of intelligence should be based on diversified 
tasks, and this is not possible if all verbal tasks involve the use of rare 
words (rare, that is, for any substantial part of the population, whether 
for educated persons or not). Also, if one has ever worked with a 
representative group of adults, using verbal tests which minimize the 
importance of a large vocabulary, one is likely to have encountered 
the occasional subject with a poor cultural background, with moderate 
to poor vocabulary, but also capable of making reasonably high scores 
in other verbal tasks. 

Contrast the Army Alpha analogies, ‘“‘ Historian—facts : : novelist 
—...” and “ Yes—no::affirmative— ... ” with Van Wagenen’s 
“Picture—frame::field— ...” and ‘“‘Man—legs: : carriage— 

.”’ For many purposes the Army Alpha material is excellent, 
and as a single instrument by which to rate intelligence the Kelley- 
Trabue Language Completion test is probably far superior to the 
sentence completion test to be described here. Like Army Alpha, 
however, the Kelley-Trabue has more than a suspicion of literary and 
vocabulary difficulty in such items as the following: “‘ . . . things are 

. satisfying to an ordinary . . . than congenial friends,” and it is 
also very difficult to score. There are occasions when one might wish 
to know how an uneducated man would compare with others when his 
vocabulary does not enter into the picture. A recent report (Elwood') 
gives a correlation of .978 between vocabulary and Stanford-Binet 
mental age. Even though the size of the correlation is partly due to 





* The assistance of N. W. Morton in the collection of data for standardization 
of the “Fourth Word Series” is gratefully acknowledged. 
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the inclusion of a number of age levels in one group, this finding might 
be used to show that vocabulary is the central fact of intelligence. 
Such a conclusion would fit nicely into a Watsonian theory of thought 
if it were not that the correlation is essentially spurious. The Stanford 
Binet is loaded throughout with vocabulary difficulty. 

It seems worth while, for research purposes at least, to provide 
verbal material whose difficulty is independent of the terminology of 
the test. This is the writer’s only excuse for meddling with an 
already good analogies test, in an attempt to extend its upper sen- 
sitivity somewhat, and for reporting sentence completion material 
without standardization data. 


‘FOURTH WORD SERIES” (ANALOGIES TEST) 


The revision of the analogies material is referred to here, and in 
clinical use, as the ‘‘ Fourth Word Series,” in line with a policy of nam- 
ing tests for adult use so that the subjects will neither smell out the 
presence of an intelligence test from its title (which may put an end to 
testing), nor obtain from it false ideas as to what they are supposed to 
do in the test. The title of a test should not supplement the formal 
instructions unless one is sure that it is equally informative to sub- 
jects of various backgrounds. * 

Preliminary experiments with the Weisenburg, Roe and McBride 
analogies were done with the aid of several students, first, to determine 
the suitability of the “Printed Analogies” as an alternate oral form. 
With an apparently representative group of adults, mean scores which 
were practically identical were obtained when both tests were used 
orally (correlation .89). Great difficulty was experienced with the 
instructions to the subject as given by the authors, and a new procedure 
was worked out. It is complex to describe, but very simple in practice. 
The subject is told that the examiner will say three words, and that the 
subject is to supply a fourth word ‘“‘that goes with the first three.”’ 
“‘Like this: Man, big; baby, ?” The pause after the second 
word is longer than that between the first and second; and there is a 
falling inflection on the second word (as if completing a sentence), 
but not on the third (as with an incomplete sentence). If thesubject, 
as he often does, fails to supply the word “small” or ‘“‘little,”’ the 
examiner himself does so, almost immediately, saying then ‘‘A man 
is big, a baby is—small,”’ and then goes on to another example. 








* Just such a fault in the name of the ‘‘ McGill Picture Anomaly Series”’ must 
be acknowledged. The word ‘‘anomaly”’ would have been better omitted. 
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Rarely is it necessary to do this with more than two examples. The 
same presentation is used for test items. The subject is given only 
the three terms of the analogy, “‘a, b; c, ?”’ instead of the for- 
mula, “‘a is to b asc is to —?”’ 





TaBLE I.—MEAN Scores ON FourtH Worp Series A BY AGE 
Groups, TOGETHER WITH MEAN Scores FOR ONE LIST OF 
THE STANFORD BINET VocaBULARY TEsT, (OLD Form) 


20-66: ALL 
AGE 20-29 30-39 40-49 54-66 SuBJeEcTs 
Number of cases............ 15 14 11 9 49 
Vocabulary................. 25.1 31.5 26.8 26.6 27 .6 
Fourth Word Series A....... 17.7 21.4 15.9 11.7 7 .7° 


* Standard deviation, form A, 6.9; form B, mean 17.2, SD 6.6. 


Attention was then given to the question of finding more difficult 
items for the upper end of the scale, the need of which is noted by 
Weisenburg, Roe and McBride. About thirty-five items were pre- 
pared and tried out thoroughly with college populations. Twenty 
were retained, divided into two lists of ten each, matched for type of 
relation and for difficulty. Weisenburg, Roe and McBride’s two lists 
of thirty-five were then pruned of some easier or less discriminating 
items, and of two or three items which possibly involved vocabulary 
difficulty. Some re-arrangement of items was made, to balance, for 
example, the number of anatomical items in the two lists (there is an 
excess of these in the “Oral Analogies’’). The result was two lists of 
forty items, each made up of thirty from the original and ten new and 
relatively difficult items. Both lists were given to fifty subjects of the 
group used in standardizing the Verbal Situation Series described 
elsewhere (Hebb and Morton*). From the results, two final lists of 
thirty items each were prepared, the “‘ Fourth Word Series” A and B. 
The test-retest product-moment reliability for the total group of fifty 
subjects is .92; for thirty subjects aged eighteen to thirty-nine, .93. 
The group as a whole appears reasonably representative of the general 
population in average level and degree of heterogeneity of ability as 
judged by educational and vocabulary-score indices, and is also 
predominantly an unsophisticated group. It is unsatisfactory, how- 
ever, in its representation of specific age groups, particularly those 
over forty. Table I presents the actual data by age groups, together 
with Stanford-Binet vocabulary scores as a rough check on the 
representiveness of the age groups. Since the subjects in the fifties 
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and sixties actually fell very nearly within a ten-year age span, these 
have been combined in the table into one group, aged fifty-four to 
sixty-six; and one subject aged eighteen is omitted from the table. 

The range of scores for Series A and B, respectively, is 2 to 30, and 
2 to 29. The last item in each series is made up of three parts—three 
separate analogies—success on any part of which is sufficient for success 
on the whole item. These six items were found to be of disproportion- 
ate difficulty, and even to succeed with one in three is more difficult 
than to succeed with the single analogy of the preceding item. 

In order to give a clearer idea of the test material, the last eight 
items of Series A are reproduced here; Numbers 23 and 27 are from 
Weisenburg, Roe and McBride; the three items numbered ‘‘30”’ are 
those of disproportionate difficulty mentioned above. All items 
retained here, however, have a satisfactory degree of validity as esti- 
mated from percentage of successes by those making high and low 
scores on the test as a whole. 


Test items: 23. Grain wheat fruit—(any kind of fruit) 
24. Marry preacher arrest—(policeman, etc.) 
25. Dead bury naked—(clothe, dress, etc.) 
26. Leapyear four Sunday—(seven) 

27. Picture frame field—(fence, hedge) 
28. Ice cool fire—(warm, lukewarm) 
29. Heat ashes woodwork (shavings, sawdust, chips) 
30. [Success on one of] 
Frost plant rifle—(man, animal, etc.) 
Thunder quiet lightning—(dark, etc.) 
Land lake sea—(island). 


These last items are in general quite difficult enough for a college 
population. If it should be desired to use such material for a high- 
school or college population, it would probably be best to make a single 
list of the items from both form A and form B numbered 17 or there- 
abouts to 30, and separating the several analogies of item 30 in each 
case into distinct items. In this way there would be obtained a single 
more difficult series of thirty or thirty-two items, with a fairly good 
degree of discrimination even with such a select population. 

Instructions and test material have been mimeographed and may 
be obtained from the writer. 
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SENTENCE COMPLETION MATERIAL 


Since this material is not standardized, it will be described briefly 
here. The test was prepared to provide enough items for a reliable 
examination and similar in content to the completion items of the new 
Stanford Binet; the kind of item is attributed by Terman & Merrill‘ to 
Minkus, and consists of relatively short sentences with “connecting 
words ’’—adverbs, prepositions and conjunctions—to besupplied by the 
subject. The resultant test is complete, with the exception of obtain- 
ing norms for it in its final shape. The material was thoroughly tried 
out in a number of revisions, with several item analyses, and with a 
rating experiment using forty-eight judges to determine the relative 
value of the various answers for each problem. The test in its present 
form has probably not as great a reliability as the “‘Fourth Word 
Series,”’ but judging from earlier data it may be expected to be over .85. 

It may be worth mentioning here that it has been found feasible, 
although a little tedious, to use the test orally. Having encountered 
in clinical work a fairly high proportion of patients with visual defects, 
the writer was impressed with the need of tests which are independent 
of vision. Accordingly a great part of the preliminary work with this 
material was actually done by reading the sentence to the subject, 
with a pause and a pencil-tap in place of the missing word (in case it 
was less confusing to the subject, the word ‘“‘blank”’ was inserted in 
place of the missing word). The process of giving the test was a slow 
one, but there was no difficulty in having the subject understand what 
he was to do, and the material was on the whole accepted well by public- 
ward patients. 

The following are examples from the upper end of the series, to show 
that difficulty has been achieved without either very elaborate sen- 
tences or a refined phraseology: 


Test items: 
Al19. You are not . . . as tall, but taller. (just, merely, etc.) 
A20. The food is good . . . itis. (as, whatever) 
A23. No one knew . . . now what he would do. (until) 
25. . . . after could he see or speak. (never) 
B21... . children older people should speak good English. 
(before, around) 
B24. It’s bad, but what . . . is possible? (else, more) 


B25. He will shoot to kill . . . happens afterward. (whatever) 
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Instructions and test material have been mimeographed and may 
be obtained from the writer. 


SUMMARY 


Two tests are described, one a revision and restandardization of 
Weisenburg and McBride’s revision of the Van Wagenen analogies, 
the other an unstandardized ‘“‘ Minkus-type”’ sentence completion test. 
The tests are designed to present an adult range of difficulty inde- 
pendent of vocabulary difficulty. 
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Undoubtedly many students who have from time to time admin- 
istered and corrected standardized or non-standardized tests have 
observed the arrangement in which the errors or unsolved problems 
appear to group themselves. For some time the writer has noticed 
this peculiar arrangement of errors: That the errors or unsolved prob- 
lems in the tests seem to fall into small groups, clusters or constella- 
tions. For example, a pupil may answer correctly ten to a dozen 
questions and then omit or fail to solve from three to five. After this, 
in most cases he will again answer or solve a number correctly. 

At first this peculiar grouping or arrangement was accepted as mere 
chance or coincidence, but finally the question as to why these errors 
should group themselves in this fashion was raised. It hardly seemed 
possible that all the most difficult questions should group themselves 
into clusters in the non-standardized tests. 

After some speculation as to the probable cause of such arrange- 
ment, the writer administered an objective intelligence test of his own 
preparation consisting of Forms A and B. In this test there was no 
attempt to arrange the items in the order of difficulty, and the test 
required no reading on the part of the pupil. 

After the test had been administered to about five hundred pupils 
and the responses upon the answer cards had been carefully corrected 
and scored, the distribution and grouping of errors were evident. 

An attempt was then made to arrange these items of the test on the 
basis of difficulty as measured by the number of pupils who had missed 
or failed to solve each question. It was assumed that an item which 
had been missed or left unsolved by a large number of pupils should 
be rated more difficult than one that had been missed by a small 
number. Upon this assumption the questions of the test were 
rearranged: The question missed by the smallest number of pupils 
was listed as Question No. 1 while the question missed by the largest 
number became Question No. 50. The initial and rearranged order 
of the questions is shown in Table I. 

From this rearrangement it is possible to ascertain whether the 
groups or constellations of questions or problems missed by the several 
pupils were of equal difficulty or not. This likewise answered the 
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question whether the difficult questions had by chance been arranged 
in groups. 


TABLE [ 
REARRANGED INITIAL REARRANGED INITIAL 
ORDER ORDER ORDER ORDER 
1 l 26 30 
2 36 27 22 
3 43 28 8 
4 37 29 18 
5 12 30 33 
6 47 31 24 
7 42 32 28 
8 2 33 25 
9 39 34 21 
10 50 35 8) 
11 15 36 45 
12 4 37 23 
13 29 38 49 
14 38 39 32 
15 26 40 31 
16 44 41 41 
17 34 42 5 
18 48 43 3 
19 14 44 6 
20 16 45 10 
21 40 46 13 
22 35 47 20 
23 17 48 27 
24 19 49 46 
25 11 50 7 


A comparison of the response or answer cards with the list of 
rearranged problems showed that about two-thirds of the total number 
of pupils, sixty-six per cent, after having solved a number of problems, 
failed to solve problems whose difficulty rank was about twenty-eighth 
on the scale, and then continue to miss or fail on problems as much as 
twenty steps of difficulty below; in fact, miss problems much lower 
in degree of difficulty that those previously solved. It is well, however, 
to remember that these steps of difficulty are not necessarily equal. 
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Tables II and III show the average slump or lowering in problem 
solving ability of two typical groups. 


TaBLE II].—SHOWING PROBLEM-SOLVING SLUMPS OF GROUP A 
Number of Pupils = 32 
DEGREE OF 
DIFFICULTY LOWEST DEGREE OF 
AT WHIcH NUMBER OF DEGREE OF RANGE OF DIFFICULTY 
PupiL SLUMP PROBLEMS DIFFICULTY FALL OR WHERE 
NuMBER BEGAN IN SLUMP OF SLUMP Stump Rise BEGAN 


1 39 3 15 24 4 
2 48 4 14 34 35 
3 50 3 4 46 29 
4 11 4 8 3 18 
5 39 3 15 24 4 
6 12 4 2 10 39 
7 11 4 8 3 18 
8 12 4 2 10 39 
9 30 3 8 22 18 
10 42 5 2 40 4 
11 12 4 2 10 39 
12 39 3 15 24 4 
13 12 5 2 10 50 
14 12 7 2 10 4 
15 43 5 12 31 2 
16 12 7 2 10 4 
17 12 4 2 10 39 
18 27 3 7 20 20 
19 43 5 12 31 2 
20 36 6 12 24 2 
21 36 9 2. 34 15 
22 43 7 2 41 50 
Average 28 4.6 6.8 21 20 


Tables II and III show that the average difficulty position of the 
problem which started or began the group of errors to be the twenty- 
eighth in position and in the third quartile of difficulty. The average 
low ebb of the slumps for these groups is in the seventh (6.8) position 
of difficulty and in the first quartile. Twenty-one per cent of the 
pupils actually showed drops or slumps of half the scale of difficulty; 
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sixteen per cent showed drops of less than half the scale of difficulty 
distance but more than one-third of it. 


TaBLE IIJ.—SHow1nG PROBLEM-SOLVING SLUMPS OF Group B 
Number of Pupils = 24 


DEGREE OF DEGREE OF 
DIFFICULTY NUMBER LOWEST DIFFICULTY 
Puri. at WHICH OF DEGREE OF RANGE OF WHERE 
Num- SLUMP PROBLEMS DIFFICULTY FALL OR RISE 
BER BEGAN INSLUMP OF SLUMP SLUMP BEGAN 

1 30 4 8 22 33 

2 17 3 11 6 30 

3 47 5 2 45 15 

4 12 4 2 10 39 

5 48 4 14 34 35 

6 47 5 2 45 15 

7 12 5 2 10 50 

8 22 3 8 14 33 

Af) 12 7 2 10 4 
10 43 8 2 41 15 
11 30 4 8 22 33 
12 12 6 2 10 15 
13 34 3 14 20 16 
14 43 3 12 31 47 
15 12 6 2 10 15 
Average 28 4.6 6 22 26 


As further verification of these groupings of errors the writer 
checked the results of the Otis Intelligence Test, Form A of ninety-two 
freshmen and found the following: Thirty-four per cent of the total 
group showed slumps of three or more questions. Of these, forty-five 
per cent encountered their first slumps in the first quartile of difficulty; 
thirty-two per cent in the second and twenty-three per cent met them 
in the third. 

The average number of questions missed during these slumps was 
4.2. These slumps or constellations of errors did not only occur in the 
responses of the pupils who scored below the average in the test, but 
also among those who scored above. To be exact, forty-five per cent 
of the pupils who scored above the average showed such slumps, while 
fifty-five per cent of the pupils who scored below did so. 
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These tables also show that after these slumps or drops, in which 
several easy items were missed or incorrectly solved, these same pupils 
again began to solve correctly problems of higher order of difficulty, 
just as the athlete gains “second wind” or stages a comeback. 

Just why this “crest and trough” or “‘rise and slump” habit should 
manifest itself in so many cases raises some questions: Is it possible 
that intellectual power flows or is generated in wave-like streams with 
crests or high points and troughs or low points, and that problems 
high on the scale of difficulty may be solved when the intellectual power 
is at its crest and problems lower on the scale are missed when this is 
at the ebb stage? Or are these slumps simply periods of temporary 
or momentary fatigue or exhaustion? Or might it be possible that 
the intellectual energy consumed in attempting the solution of the 
difficult problem produces a drop in potential in the intellectual energy 
immediately following, just as electrical potential is lowered when a 
machine is forced or overloaded? Or does the difficulty of the problem 
produce a drop in the level of aspiration for problem-solving which 
causes problems lower on the scale of difficulty to be missed or incor- 
rectly solved? 

It is believed that we have cycles of aspiration! and depression of 
weeks or days. Might we likewise have such cycles in lesser degree at 
shorter intervals during the process of thinking and problem solving? 





! Pennington, L. A.: “Shifts in Aspiration Level After Suecess and Failure in 
College Classroom.”’ The Journal of General Psychology, Vol. xxi, 1940, pp. 
305-313. 














AN IMPROVED SELF-MARKING ANSWER SHEET 


RICHARD WALLEN 
Department of Psychology 


AND 
GEORGE RIEVESCHL, JR. 
Department of Chemical Engineering, University of Cincinnati 


In order to facilitate the scoring of objective tests a number of 
techniques have been devised which eliminate several of the operations 
necessary in ordinary manual scoring. Few of them have been widely 
adopted because their advantages do not offset the accompanying 
expense and inconvenience. The method here proposed will, we 
believe, eliminate many of the disadvantages of earlier methods while 
retaining the time and energy economies afforded by them. 

Among the requirements for a useful self-marking answer sheet 


are the following: 


(1) Expensive machinery should be unnecessary. 

(2) Cumbersome answer pads and folders should be avoided. 

(3) Special equipment for marking answers should not be 
needed. 

(4) Error identification must be definite and clear. 

(5) Provision should be made for respondents to change their 
answers. 

(6) Provision should be made for independent checking of 
errors. 


Few, if any, self-scoring devices invented to date fulfill all these 
demands. I.B.M. equipment,’ although useful in large projects, is 
too expensive where fewer than two hundred tests are to be scored. 
In addition, special pencils must be supplied for marking answers. 
The method developed by Toops‘ entails the inconvenience of styli 
and cumbersome answer pads. In his method, as in that of Clapp 
and Young,! there is some difficulty in scoring changes of answers, 
especially when the respondent makes a correct choice, erases it, and 
finally decides to mark it. Carbon sheet devices of the Clapp- Young 
type also necessitate a backing affixed to the answer sheet in order 
to conceal the key, and thus expense and inconvenience are added. 
Although the Peterson* answer blank is a single sheet, users must go 
to the trouble of supplying water and brushes to respondents. 
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It has long been known that there are a great many chemical 
compounds which are colorless in ordinary light but appear brightly 
colored when viewed under ultra-violet light. This principle of 
ultra-violet fluorescence is the basis of our method. The answer 
sheet is so designed that, following the number of the item, five small 
circles are printed. Each one corresponds to one of the five possible 
choices offered in the item, and the center of that circle representing 
the correct choice is impregnated with a minute amount of an invisible 
fluorescent substance. The respondent is instructed to blacken with 
his lead pencil the entire center of the circle which corresponds to 
the answer he has selected. When the answer sheet, marked accord- 
ing to these instructions, is viewed by ultra-violet light, errors (7.e., 
correct choices that have not been marked) stand out as vivid spots 
of light. The right responses, however, show little, if any, fluorescent 
brilliance, since the impregnated regions have been covered by a 
relatively opaque layer of graphite. In scoring one needs only to 
count the bright spots (errors) appearing on the answer sheet. 

The bulb which we use is a General Electric Purple-X, which has a 
life of fifty hours at one hundred fifteen volts if used intermittently. 
This lamp is obtainable for $1.25 and has a standard base for mounting 
in 110 volt sockets. It is placed in a desk light and adjusted to a 
height of ten to twelve inches above the papers. Although the room 
in which the grading is done need not be completely darkened, con- 
siderably less than normal illumination is desirable. Under such 
conditions the scoring time for forty answer sheets of thirty-five items 
is about six to eight minutes. If an independent check of the score 
is desired, it is possible to make a scoring stencil and check the answers 
in ordinary light. The time required for checking with a stencil 
is approximately double that needed for fluorescent scoring. 

Obviously certain precautions must be observed if the best results 
are to be obtained. A pencil with medium or soft lead will “black 
out” the fluorescent spots more effectively than will one with hard 
lead. In order to produce maximum discrimination in scoring, 
students must be cautioned to blacken the entire area within the small 
circle. Erasures are allowed, but they must not be so vigorously 
made as to damage the surface of the sheet. Since the fluorescent 
chemical penetrates the paper, erasing will not remove it, although 
it is dimmed under ultra-violet light when the surface is badly scratched. 
In practice we have found that these precautions are easily observed 
and do not cause serious inconvenience. 
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A number of colorless fluorescent substances may be used in making 
the answer sheets. The hydrocarbons, anthracene and fluorene, and 
derivatives of umbelliferone were found most satisfactory. These 
materials appear bright blue-white under ultra-violet light. In 
applying them to the paper a solvent must be employed which will 
not alter the appearance of the surface on drying. The commercially 
available ‘“‘Cellosolve”’ is satisfactory for this purpose. It is impos- 
sible even for one who knows the process to discover the correct 
answers by viewing the sheet in ordinary light. 

One difficult problem was to develop a method for impregnating 
the answer sheets rapidly and accurately. So far only manual methods 
have yielded the necessary accuracy, but there is hope that production 
of the sheets may be speeded. 

The method of self-scoring we have suggested requires no equip- 
ment other than an ultra-violet bulb; it avoids the need for special 
styli and pencils and eliminates cumbersome answer pads. Further, 
it permits respondents to erase right or wrong answers without seri- 
ously impairing the discrimination of errors. 
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